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Chapter  1 


Introduction  and  Basic  Definitions 


1.1  Overview 

Hidden  Markov  Models  (HMMs)  are  one  of  the  more  popular  and  successful  tech¬ 
niques  for  pattern  recognition  in  use  today.  For  example,  experiments  in  speech  recog¬ 
nition  have  shown  that  HMMs  can  be  useful  tools  in  modelling  the  variability  of  hu¬ 
man  speech. ([juang91],[lee88],[rabiner86],[bahl88])  Hidden  Markov  Models  have  also 
been  used  in  computational  linguistics  [kupiec90],  in  document  recognition  [kopec91] 
and  in  such  situations  where  intrinsic  statistical  variability  in  data  must  be  accounted 
for  in  order  to  perform  pattern  recognition.  HMMs  are  constructed  by  considering 
stochtistic  processes  that  are  probabilistic  functions  of  Markov  Chains.  The  under- 
lying  Markov  Chain  is  never  directly  measured  and  hence  the  name  Hidden  Markov 
Model. ^  An  example  of  an  HMM  could  be  the  artificial  economy  of  Figure  1.1.  The 
economy  in  this  figure  transitions  probabilistically  between  the  states  Depressed,  Nor¬ 
mal,  and  Elevated.  The  average  stock  price  in  each  of  these  states  is  a  probabilistic 
function  of  the  state.  Typically,  pattern  recognition  using  Hidden  Markov  Models  is 
carried  out  by  building  HMM  source  models  for  stochastic  sequences  of  observations. 

^  Hidden  Markov  Models  are  also  closely  related  to  Probabilistic  Automata.  Appendix  A  discusses 
the  connections  in  detail  and  shows  that  with  appropriate  definitions  of  equivalence,  HMMs  can  be 
considered  a  subclass  of  probabilistic  automata. 
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Figure  1-1:  A  Hidden  Markov  Model  Economy 

A  given  sequence  is  classified  as  arising  from  the  source  whose  HMM  model  has  the 
highest  a  posteriori  likelihood  of  producing  it.  Despite  their  popularity  and  relative 
success,  HMMs  are  not  well  understood  on  a  fundamental  level.  This  thesis  attempts 
to  lay  part  of  a  foundation  for  a  more  principled  use  of  Hidden  Markov  Models  in 
pattern  recognition.  In  the  next  section  I  will  briefly  describe  the  history  of  func¬ 
tions  of  Markov  Chains  as  relevant  to  this  thesis.  I  will  then  proceed  to  discuss  the 
motivations  underlying  this  research  and  the  major  questions  that  I  address  here. 
Principally,  these  questions  will  involve  the  development  of  feist  algorithms  for  decid¬ 
ing  the  equivalence  of  HMMs  and  reducing  them  to  minimal  canonical  forms.  The 
chapter  will  conclude  by  introducing  the  basic  definitions  and  notation  necessary  for 


1.2.  HISTORICAL  OVERVIEW 


understanding  the  rest  of  the  thesis. 


1.2  Historical  Overview 

As  mentioned  in  the  previous  section.  Hidden  Markov  Models  are  probabilistic  func¬ 
tions  of  Markov  Chains,  of  which  the  artificial  economy  in  Figure  1.1  is  an  example. 
The  concept  of  a  function  of  a  Markov  Chain  is  quite  old  and  the  questions  answered 
in  this  thesis  seem  to  have  been  first  posed  by  Blackwell  and  Koopmans  in  1957  in  the 
context  of  related  deterministic  functions  of  Markov  Chains. [blackwell57]  This  work 
sought  to  find  necessary  and  sufficient  conditions  that  would  “identify"  equivalent 
deterministic  functions  of  Markov  Chains,  and  studied  the  question  in  some  special 
cases.  Gilbert,  in  1959,  provided  a  more  general,  but  still  partial,  answer  to  this 
question  of  “identifiability”  of  deterministic  functions  of  Markov  Chains. [gilbert59] 
The  topic  was  studied  further  by  several  authors  who  elucidated  various  aspects 
of  the  problem.  ([burke58],  [dharma63a],  [dharma63b],  [dharma68],  [bodreau68|, 
[rosenblatt71])  Functions  of  Markov  Chains  were  also  studied  under  the  rubric  “Grouped 
Markov  Chains”,  and  necessary  and  sufficient  conditions  were  established  for  equiva¬ 
lence  of  a  Grouped  Chain  to  a  traditional  Markov  Chain. ([kemeney65],  [iosifescu80]) 
Interest  in  functions  of  Markov  Chains,  and  particularly,  probabilistic  functions  of 
Markov  Chains,  has  been  revived  recently  because  of  their  successful  applications  in 
speech  recognition.  The  most  effective  recognizers  in  use  today  employ  a  network 
of  HMMs  as  their  beisic  technology  for  identifying  the  words  in  a  stream  of  spoken 
language. ([lee88],[levinson83])  Typically,  the  HMMs  are  used  as  probabilistic  source 
models  which  are  used  t  compute  the  posterior  probabilities  of  a  word,  given  a  model. 
This  thesis  arises  from  an  attempt  to  build  part  of  a  foundation  for  the  principled  use 
of  HMMs  in  pattern  recognition  applications.  We  provide  a  complete  characterization 
of  equivalent  HMMs  and  give  an  algorithm  for  reducing  HMMs  to  minimal  canon¬ 
ical  representations.  Some  work  on  the  subject  of  equivalent  functions  of  Markov 
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Chains  h«is  been  done  concurrently  with  this  thesis  in  Japan. [itoO'ij  However,  Ito  et 
al.  work  with  less  general  deterministic  functions  of  Markov  Chains,  and  find  an 
algorithm  for  checking  equivalent  models  that  takes  time  exponential  in  the  size  of 
the  chain.  (In  this  thesis,  we  achieve  polynomial  time  algorithms  in  the  i  ontext  of 
more  general  probabilistic  functions  of  Markov  Chains.)  .Some  work  has  been  done  by 
\  .Kamp  on  the  subject  of  reduction  of  states  in  HMMs.[kamp85]  However,  Kamp's 
work  only  considers  the  very  limited  case  of  reducing  pairs  of  states  with  identical 
output  distributions,  in  left-to-right  models.  There  has  also  been  some  recent  work 
in  the  theory  of  Probabilistic  Automata  (PA)  which  uses  methods  similar  to  ours 
to  study  equivalence  of  PAs.ftzeng)  Tzeng  cites  the  work  of  Azaria  Paz  [paz71]  and 
others  as  achieving  the  previous  best  results  for  testing  equivalence  of  Probabilistic 
Automata.^  Appendix  A  will  define  Probabilistic  Automata  and  discuss  thei’’  con¬ 
nections  with  HMMs.  In  Chapter  3  we  will  define  Generalized  Markov  Models,  a  new 
class  of  models  for  stochastic  processes  that  are  derived  by  relaxing  the  positivity 
constraint  on  some  of  the  parameters  of  HMMs.  The  idea  of  defining  GMMs  arises 
from  work  by  L. Niles,  who  studied  the  relationship  between  stochcistic  pattern  clas¬ 
sifiers  and  “neural”  network  schemes. [nilesQO]  Niles  demonstrated  that  relaxing  the 
positivity  constraint  on  HMM  parameters  had  a  beneficial  effect  on  the  performance 
of  speech  clcissifiers.  He  proceeded  to  interpret  the  negative  weights  &s  inhibitory 
connections  in  a  network  formulation  of  HMMs. 

1.3  The  Major  Questions 

Despite  their  popularity  and  relative  success  HMMs,  are  not  well  understood  on 
a  theoretical  level.  If  we  wish  to  apply  these  models  in  a  principled  manner  to 

^Paz’s  results  placed  the  problem  of  deciding  equivalence  of  Probabilistic  Automata  in  the  com¬ 
plexity  class  co-NP.  It  is  well  known  that  equivalence  of  deterministic  automata  is  in  P  and  equiv¬ 
alence  of  nondeterministic  automata  is  PSPACE-complete.  Tzeng  decides  equivalence  of  PAs  in 
polynomial  time  using  methods  similar  to  ours. 


i .  3.  THE  MAJOR  Q  VEST  IONS 
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Bayesian  classification,  we  should  know  that  HMMs  are  able  to  accurately  represent 
the  class-conditional  stochastic  processes  appropriate  to  the  classification  domain. 
Unfortunately,  we  do  not  understand  in  detail  the  class  ot  processes  that  can  be 
modelled  exactly  by  Hidden  Markov  Models.  Even  woi  c,  we  do  not  know  how  many 
states  an  HMM  would  need  in  order  to  approximate  a  given  stochastic  process  to 
a  given  degree  of  accuracy.  VVe  do  not  even  have  a  good  grasp  of  precisely  what 
characteristics  of  a  stochastic  process  are  difficult  to  model  using  HMMs.^  There  is 
a  wide  body  of  empirical  knowledge  that  practitioners  of  Hidden  Markov  Modelling 
have  built  up,  but  I  feel  that  the  collection  of  useful  heuristics  and  rules  of  thumb 
they  represent  are  not  a  good  foundation  for  the  principled  use  of  HMMs  in  pattern 
recognition.  This  thesis  arises  from  some  investigations  into  the  properties  of  HMMs 
that  are  important  for  their  use  as  pattern  recognizers. 

1.3.1  Intuitions  and  Directions 

The  basic  intuition  underlying  a  comparison  of  the  relative  expressiveness  of  Hidden 
Markov  Models  and  the  well-understood  Markov  Chains  suggests  that  HMMs  should 
be  more  “powerful”  since  we  can  store  information  concerning  the  past  in  probability 
distributions  that  are  induced  over  the  hidden  states.  This  stored  information  per¬ 
mits  the  output  of  a  finite-state  HMM  to  be  conditioned  on  the  entire  past  history 
of  outputs.  This  is  in  contrast  with  a  finite-state  Markov  Chain  which  can  be  condi¬ 
tioned  only  on  a  finite  history.  On  the  other  hand,  the  amount  of  information  about 
the  output  of  an  HMM  at  time  t,  given  by  the  output  at  time  {t  —  n),  should  drop  off 
with  n.  It  can  also  be  seen  that  there  are  many  HMMs  that  are  models  of  exactly  the 
same  process,  implying  that  there  can  be  many  redundant  degrees  of  freedom  in  a 
Hidden  Markov  Model.  This  leads  to  the  auxiliary  problem  of  trying  to  characterize 
Hidden  Markov  Models  in  terms  of  equivalence  classes  that  are  models  of  precisely 

^We  can,  however,  reach  some  conclusions  quickly  by  considering  analogous  questions  for  Finite 
Automata.  For  example,  it  should  not  be  possible  to  build  a  finite-state  HMM  that  accurately 
models  the  long-term  statistics  of  a  source  that  emits  pallindromes  with  high  probability 
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the  same  process.  Such  an  endeavour  would  give  some  insight  into  the  features  of  an 
HMM  that  contribute  to  its  expressiveness  as  a  model  for  stochastic  processes.  Given 
a  characterization  in  terms  of  equivalence  classes,  every  HMM  could  be  reduced  to 
a  canonical  form  which  would  essentially  be  the  smallest  member  of  its  class.  This 
is  a  prerequisite  for  the  problem  of  characterizing  the  processes  modelled  by  HMMs. 
since  we  should,  at  the  very  least,  be  able  to  say  what  makes  one  model  different  from 
another.  Furthermore,  the  canonical  representation  of  an  HMM  would  presumably 
remove  many  of  the  superfluous  features  of  the  model  that  do  not  contribute  to  its  in¬ 
trinsic  expressiveness.  Therefore,  we  could  more  easily  understand  the  structure  and 
properties  of  Hidden  Markov  Models  by  studying  their  canonical  representations.  In 
addition,  a  minimal  representation  for  a  stochastic  process  within  the  HMM  frame¬ 
work  is  an  abstract  mezisure  of  the  complexity  of  the  process.  This  idea  has  some 
interesting  connections  with  Minimum  Description  Length  principles  and  ideas  about 
Kolmogorov  Complexity.  However,  these  connections  are  not  explored  in  this  thesis. 

1.3.2  Contributions  of  This  Thesis 

Keeping  the  goals  described  above  in  mind,  I  have  developed  quick  methods  to  decide 
equivalence  of  Hidden  Markov  Models  and  reduce  them  to  minimal  canonical  forms. 
On  the  way,  I  introduce  a  convenient  generalization  of  Hidden  Markov  Models  that  re¬ 
laxes  some  of  the  constraints  imposed  on  HMMs  by  their  probabilistic  interpretation. 
These  Generalized  Markov  Models  (GMMs),  defined  in  Chapter  3,  preserve  the  essen¬ 
tial  properties  of  HMMs  that  make  them  convenient  pattern  classifiers.  They  arise 
from  the  point  of  view  that  having  a  probabilistic  interpretation  of  HMM  parameters 
is  peripheral  to  the  goal  of  designing  convenient  and  parsimonious  representations 
for  stochastic  processes.  The  reduction  algorithm  for  Hidden  Markov  Models  will, 
in  fact,  reduce  HMMs  to  their  minimal  equivalent  GMMs.  Towards  the  end  of  the 
thesis,  I  will  also  briefly  consider  the  problem  of  approximate  equivalence  of  models. 
This  is  important  because,  in  any  practical  situation,  HMM  parameters  are  estimated 
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from  data  and  are  subject  to  statistical  variability.  I  have  listed  the  major  results  of 
this  thesis  below.  I  have  developed: 

1.  A  polynomial  time  algorithm  to  check  equivalence  of  prior  probability  dis¬ 
tributions  on  a  given  model. 

2.  A  polynomial  time  algorithm  to  check  equivalence  of  HMMs  with  fixed 
priors. 

3.  A  polynomial  time  algorithm  to  check  the  equivalence  of  HMMs  for  arbi¬ 
trary  priors. 

4.  A  definition  for  a  new  type  of  classifier,  a  Generalized  Markov  Model 
(GMM),  that  is  derived  by  relaxing  the  positivity  constraint  on  HMM  pa¬ 
rameters.  We  will  give  a  detailed  description  of  the  relationship  between 
HMMs  and  GMMs. 

5.  A  polynomial  time  algorithm  to  canonicalize  a  GMM  by  reducing  it  to  a 
minimal  equivalent  model  that  is  essentially  unique.  The  minimal  repre¬ 
sentation,  when  appropriately  restricted,  will  be  a  minimal  representation 
of  HMMs  in  the  GMM  framework.  The  result  will  also  involve  a  charac¬ 
terization  of  the  essential  degree  of  expressiveness  of  a  GMM. 

We  will  see  that  all  these  results  are  easy  to  achieve  when  cast  the  language  of 
linear  vector  spaces.  The  problems  discussed  here  have  remained  open  for  quite  a 
long  time  because  they  were  not  Ccist  in  the  right  language  for  easy  solution. 


1.4  Basic  Definitions 

I 


In  this  section  I  will  define  Hidden  Markov  Models  formally,  and  I  will  introduce  the 
basic  notation  and  concepts  that  will  be  useful  in  later  chapters. 


14 


CHAPTER  1.  INTRODi'CTIOS  ASD  BASIC  DEFIMTIOSS 


1.4.1  Hidden  Markov  Models 

Definition  1.1  (Hidden  Markov  Model) 

.4  Hidden  Markov  Model  can  be  defined  as  a  quadruple  .Vf  =  (<5.(9,A.B)  where 
Si  ^  S  are  the  states  of  the  model  and  Oj  ^  O  are  the  outputs.  Taking  s{t)  to  be 
the  state  and  o(f)  the  output  of  Ai  at  time  t,  we  also  define  the  transition  matrix 
A  and  the  output  matrix  B  so  that  A,j  =  Pr(s(t)  =  —  1)  =  Sj)  and  B.j  = 

Pr(o(f)  =  o,|5(f)  =  Sj).  In  this  thesis  we  only  consider  HMMs  with  discrete  and  finite 
state  and  output  sets.  So,  for  future  use  we  also  let  n  =  |5|  and  k  =  \0\. 

In  order  for  an  HMM  to  model  a  stochastic  process,  it  must  be  initialized  by 
specifying  an  initial  probability  distribution  over  states.  The  model  then  transitions 
probabilistically  between  its  states  based  on  the  parameters  of  its  transition  matrix 
and  emits  symbols  based  on  the  probabilities  in  its  output  matrix.  Therefore,  we 
define  an  Initialized  Hidden  Markov  Model  as  follows: 

Definition  1.2  (Initialized  Hidden  Markov  Model) 

An  Initialized  Hidden  Markov  Model  is  a  quintuple  M  —  {S,  O,  A,  B,  The  symbols 
S,  O,  A,  and  B  represent  the  same  quantities  as  they  do  in  Definition  1.1.  p  is 
probability  vector  such  that  pi  is  the  probability  that  the  model  starts  in  state  s,  at 
time  t  =  0.  IVe  take  p  to  be  a  column  vector.  Having  fixed  the  priors,  the  model 
may  be  evolved  according  to  the  probabilities  encoded  in  the  transition  matrix  A  and 
the  output  matrix  B.  IfAf  is  a  given  Hidden  Markov  Model,  we  will  use  the  notation 
j\'{p)  to  denote  the  HMM  Af  initialized  by  the  prior  p. 

Figure  1.2  shows  an  example  of  a  Hidden  Markov  Model  ais  defined  above.  Our 
definition  is  slightly  different  from  the  standard  definition  of  HMMs  which  actually 
corresponds  to  our  Initialized  Hidden  Markov  ^  ds.  In  our  formulation,  an  HMM 
defines  a  class  of  stochcistic  processes  correspond  i;  to  different  settings  of  the  prior 
probabilites  on  the  states.  An  Initialized  Hidden  Markov  Model  is  a  specific  process 
derived  by  fixing  a  prior  on  an  HMM. 
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Figure  1-2;  A  Hidden  Markov  Model 

1.4.2  Variations  on  The  Theme 

It  should  be  pointed  out  that  many  variants  of  Hidden  Markov  Models  appear  in 
the  literature.  Authors  have  frequently  used  models  in  which  the  outputs  are  as¬ 
sociated  with  the  transitions  rather  than  the  states.  It  can  be  shown  quite  easily 
that  it  is  always  possible  to  convert  such  a  model  into  an  equivalent  HMM  accord¬ 
ing  to  our  definition.^  However,  for  somewhat  technical  reasons,  converting  from  a 
hidden-transition  HMM  to  a  hidden-state  HMM  requires,  in  general,  an  increase  in 
the  number  of  states.  The  literature  also  frequently  uses  models  with  continuously 
varying  observables.  These  are  easily  defined  by  replacing  the  “ouput  matrix’’  B  by 
continuous  output  densities.  HMMs  with  Gaussian  output  densities  are  related  to 
the  Radial  Basis  Functions  of  [poggio89|.®  Some  authors  also  designate  “absorbing 
states”  which,  when  entered,  cause  the  model  to  terminate  production  of  a  string. 

‘‘This  is  analogous  to  the  equivalence  of  Moore  and  Mealy  Finite  State  Machines 
^Suppose  is  a  Hidden  Markov  Model  v.  th  states  S  =  {si.st,  .Sn}  and  Gaussian  output 
distributions  {G,,,  G,j,  ■  •  •  ,G,„}  associated  with  the  states.  Also  let  x  =  (o(l),  o(2),  ,  o(t))  is  an 
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The  analysis  of  such  absorbing  models  is  somewhat  different  from  that  of  the  HMMs 
in  Definition  1.1  for  uninteresting  technical  reasons.  For  the  substantive  problems  of 
pattern  recognition  an  absorbing  model  can  always  be  “simulated"  in  our  formula¬ 
tion  by  creating  a  state  which  emits  a  single  special  output  symbol  and  loops  with 
probability  1  onto  itself. 

1.4.3  Induced  Probability  Distributions 

As  described  in  the  definition  of  Initialized  HMMs,  a  stochastic  process  can  be  mod¬ 
elled  using  Hidden  Markov  techniques  by  constructing  an  appropriate  HMM,  initial¬ 
izing  it  by  specifying  a  prior  probability  distribution  on  the  states,  and  then  evolving 
the  model  according  to  its  parameters.  This  evolution  then  produces  output  strings 
whose  statistics  define  a  stochastic  process  over  the  output  set  of  the  model.  In 
recognition  applications  we  are  usually  interested  in  the  probability  that  a  given  ob¬ 
servation  string  was  produced  by  a  source  whose  model  is  a  given  HMM.  We  quantify 
this  by  defining  the  probability  distribution  over  strings  induced  by  an  Initialized 
Hidden  Markov  Model: 

Definition  1.3  (Induced  Probability  Distributions) 

Suppose  we  are  given  an  HMM  AA.  =  (<S,  O,  A,  B)  and  a  prior  distribution  p.  Borrow 
the  standard  notation  oj  ih  i  theory  of  regular  languages,  and  let  O’  denote  the  set  of 
all  finite  length  strings  that  can  be  formed  by  concatenating  symbols  in  O  together. 
We  then  define  the  probability  that  a  given  string  x  ^  O’  is  produced  by  M.(p)  as 

output  string  of  length  t,  Then  we  can  use  Equation  1.1  to  write: 

?x{x\M,p)  =  Y.  Pr(s(l),-  -.s(«)|M.^  Pn^|s(l),  - 

«(n,  ,»(») 

=  Y  Pr(*(l).--.s(OhM,plG.(i)[o(l)].G,(, )[«(<)] 

*(1), ••,>(«' 

Each  of  the  products  of  Gaussians  in  the  second  equation  defines  a  “center”  for  a  Radial  Basis 
Function.  The  sum  over  states  then  evaluates  a  weighted  sum  over  the  activations  of  the  various 
“centers”  which  are  produced  as  appropriate  permutations  of  the  Gj, 
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follows.  Let  m  =  jxl  and  let  Si,  S2  ■■■  Sm  €  S .  Then: 

Pr(x|.V(,^  =  Pr(x|.Vl(^,|xl)  =  ^  ?t(si,  ■  ■  ■  s„,\M{p))Pv(x\si,  ■  ■  ■  s^)  (M) 

.^2 

Essentially,  given  a  model  AA,  the  probability  of  a  string  x  of  length  ni  is  the  likelihood 
that  the  model  will  emit  the  string  x  while  traversing  any  length  m  sequence  of  states. 
Because  the  definition  conditions  the  probability  on  the  length  of  the  string.  Pr(x|-V(,/7) 
defines  a  probability  distribution  over  strings  x  of  length  m  for  each  postive  integer 
m.  We  let  e  represent  the  null  string  and  set  Pr(c|^Vf,^  =  1. 

The  probability  distributions  defined  above  specify  the  statistical  properties  of  the 
stochastic  process  for  which  the  HMM  initialized  by  p  is  a  source  model.  Typical 
pattern  recognition  applications  evaluate  this  “posterior  probability”  of  an  observa¬ 
tion  sequence  given  each  of  a  collection  of  models  and  classify  according  to  the  model 
with  the  highest  likelihood. 

So  an  HMM  defines  a  class  of  stochastic  processes  -  each  process  corresponding 
to  a  different  choice  of  initial  distribution  on  the  states.  This  immediately  raises  the 
question  of  testing  whether  two  prior  distributions  on  a  given  model  induce  identical 
processes.  In  Chapter  2  we  will  see  that  there  is  an  efficient  algorithm  for  deciding 
this  question.  But  first,  in  the  next  section,  we  will  introduce  some  notation  and 
techniques  that  show  how  to  use  the  basic  definitions  to  calculate  the  quantities  of 
interest  to  us. 


1.5  How  To  Calculate  With  HMMS 

The  beisic  quantity  we  are  interested  in  calculating  is  the  probabilicy  a  given  string  will 
be  produced  by  a  given  model.  We  will  see  later  that  for  purposes  of  determining  the 
equivalence  of  models  and  reducing  them  to  canonical  forms  it  also  useful  to  compute 
various  probability  distributions  over  the  states  and  the  outputs.  In  this  section  we 
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will  introduce  some  notation  that  will  enable  us  to  mechanize  the  computation  of  these 
quantities  so  that  later  analysis  becomes  easy.  The  notation  and  details  may  become 
tedious  and  confusing  and  so  the  reader  may  wish  to  skim  the  section,  referring  back 
to  it  as  necessary. 

Definition  1.4  (State  and  Output  Distributions) 

Let  M  =  (5,C?,  A,B)  be  an  HMM  with  a  prior  p,  n  states  and  k  outputs.  Let  s(^) 
and  o{t)  be,  respectively,  the  state  and  output  at  time  t.  Let  k{t)  be  an  n-dimensional 
column  vector  such  that  k,{t)  =  Pr(s(<)  =  s,, o(  1 ), o(2),  •  •  •  , o(t  —  \)\j\A.,p).  In  other 
words,  k^{t)  is  the  joint  probability  of  being  in  at  time  t  and  seeing  all  the  previous 
outputs.  We  define  m,{t)  to  be  the  probability  of  being  in  state  s,  after  also  seeing 
the  output  at  time  t:  m,{t)  =  Pr(s(t)  =  s,, o(l),  •  •  • , o(<  —  Finally,  let 

l{t)  be  a  column  vector  describing  the  probabilities  of  the  various  outputs  at  time  t: 
li{t)  =  Pr(o(<)  =  Oi,  o(l),  o(2),  ■  ■  •  o{t  —  From  the  deflation  of  the  B  matrix, 

we  can  write  this  as:  l{t)  =  3k{t). 

In  order  to  determine  equivalence  of  HMMs  and  reduce  them  to  canonical  forms 
’ve  will  need  to  be  able  to  reason  conveniently  about  the  temporal  evolution  of  the 
model.  Using  Definition  1.4  we  can  write  that  k{t  +  1)  =  Arn(().  Furthermore,  if 
o{t)  =  Oj  we  can  factor  the  definition  of  m{t)  to  write: 

m,(0  =  Pr(o(<)  =  Ojls(f)  =  s,, 7Vf,p,o(l),---  ,o(<  -  l))Pr(s(f)  =  s,,o(l),  •  ■  •  ,o(f  - 

=  PT{o{t)  =  Oj\s{t)  =  s,)ki{t) 

=  Bjiki{t) 

In  order  to  write  Equation  1.2  more  compactly,  we  introduce  the  following  notion  of 
a  projection  operator. 


l)\M,p) 

(1.2) 


Definition  1.5  (Projection  Operators) 

Suppose  an  HMM  M.  =  («S,C?,  A,B)  has  k  outputs.  We  define  a  set  of  projection 
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operators  {Bi,B2,-  ■  •  B*}  so  that  B.  =  Diag[ith  row  of  B].  In  other  words  B,  is  a 
diagonal  matrix  whose  diagonal  elements  are  the  row  in  B  corresponding  to  the  output 
symbol  o,.  Sometimes  we  will  use  the  notation  Bo  to  mean  the  projection  operator 
corresponding  to  the  the  output  o.  (i.e.  Bo  =  12o,&o  where  S{a.b)  is  1  if 

a  =  b  and  0  otherwise.)  Suppose  v  is  a  vector  whose  dimension  equals  the  number 
of  states  of  the  model.  Then  multiplying  v  by  B,  weights  each  component  of  v  by  the 
probability  that  the  corresponding  state  would  emit  the  output  o,. 

We  can  use  the  projection  operator  notation  to  compactly  write  Equation  1.2  as 
m(<)  =  Bo(e)i(f).  Now  we  can  write  ^(<+l)  =  A'Bo{t)fc{t)  and  rn(^+l)  =  Bo(t+i)Am(<). 
In  order  to  summarize  this  we  introduce  a  set  of  definitions  for  the  transition  operators 
of  a  Hidden  Markov  Model. 

Definition  1.6  (Transition  Operators) 

Given  an  HMM  M  =  (5,0,A,B)  with  n  states  we  define  the  model  transition 
operators  as  follows.  Let  e  be  the  null  string.  Define  T(e)  =  I  where  I  is  the  n  x  n 
identity  matrix.  Also,  for  every  o,  €  O  define  T(o/t)  =  ABo^ .  We  can  see  that  T(ojt)q 
is  the  probability  of  emitting  Ok  in  state  Sj  and  then  entering  s,.  We  extend  these  to 
be  transition  operators  on  O” as  follows.  For  any  output  string  x  =  {oi,  02  Ot)  €  O' 
let: 

T(x)  =  T(oi,  •  •  •  O.)  =  T(ot)T(o,_i)  •  ■  •  T(oi)  (1.3) 

We  can  interpret  these  extended  transition  operators  by  noticing  that  T(x),j  is  the 
probability  of  starting  in  state  Sj,  emitting  the  string  x,  and  then  entering  state  s,. 

Using  the  transition  operators  of  Definition  1.6  we  can  coveniently  write  all  the  quan¬ 
tities  we  wish  to  compute.  Suppose  M  is  an  HMM  with  n  states,  k  outputs  and  prior 
p.  Take  i/  to  be  the  output  string  {01,02  ••  •  0()  and  1  to  be  an  n-dimensional  vector 
all  of  whose  entries  are  1.  Also  let  Xt-i  be  the  t  —  I  long  prefix  of  Xf.  Then  we  can 
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see  that: 

m{t) 

=  Bo,T(j:,_i)p 

(1.4) 

fc[t  -f  1 ) 

=  Am{t)  =  T{xt)p 

(1.0) 

ht  +  i) 

=  -f  1)  =  BT(x,)p 

(1.6) 

Prixt\M.^ 

=  r-(T(x,)p) 

(1-7) 

The  reader  may  wish  to  verify  some  of  these  equations  from  the  definitions  to  ensure 
his  or  her  facility  with  the  notation. 

1.6  Roadmap 

This  chapter  hcis  developed  the  background  necessary  for  understanding  the  results 
in  this  thesis.  The  basic  definitions  and  notation  given  here  are  summarized  in  Ta¬ 
ble  l.l.  Chapter  2  discusses  the  algorithms  related  to  equivalence  of  Hidden  Markov 
Models.  Chapter  3  defines  Generalized  Markov  Models  and  describes  the  algorithm 
for  reducing  HMMs  to  minimal  canonical  forms.  Chapter  3  also  contains  a  funda¬ 
mental  characterization  of  the  essential  expressiveness  of  a  Hidden  Markov  Model. 
Chapter  4  presents  some  preliminary  ideas  concerning  several  topics  including  ap¬ 
proximate  equivalence  and  potential  practical  applications  of  the  results  of  this  thesis. 
Finally,  Appendix  A  shows  how  HMMs,  in  the  formulation  of  this  paper,  are  related 
to  Probabilistic  Automata. 
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Given:  an  HMM  .^A  —  A.B)  with  n  states,  h  outputs  arui  prior 

Definitions: 

1.  Pv{x\M.^  =  Pvix\M{^.\x\)  = 

•  •  •  s^|.Vt(^)  Pr(xlsi,  ■  ■  ■  s„J 

2.  Pr(e|wV(,^  =  1  where  e  is  the  null  string 

3.  k(t)  is  an  n-dimensional  vector  such  that 
k,{t)  =  Pr(s(0  =  s.,o(l),o(2),  --,o(t  - 

4.  m(t)  is  an  n-dimensional  vector  such  that 

m,(t)  =  Pr(s(<)  =  s,o(l),  •  •  •  o(t  - 

5.  l(t)  is  a  k-dimensional  vector  such  that  l,(t)  = 

Pr(o(t)  =  0,0(1), o(2)  1)|-V<,^ 

6.  The  projection  operators  {61,62,-  Bt} 
are  defined  as  Bi  =  Diaglitk  row  0/  B].  Also 
if  o  €  O  then  we  write  Bo  to  denote  the  pro¬ 
jection  operator  corresponding  to  output  o. 

7.  We  define  transition  operators  so  that: 

T(e)  =  I 

T(ojt)  =  ABit, 

T(o(l),o(2),---o(<))  =  T(o(t)).--T(o(2))T(o(l)) 

Model  Evolution: 

1.  Suppose  the  HMM  emits  the  output  Xt  = 

[0(1),  0(2),  •  •  ■  o(t)].  Also  use  the  notation  Xt-i 
to  mean  the  t — 1  long  prefix  of  i(,  and  the  sym¬ 
bol  1  to  mean  the  n-dimensional  vector  all  of 
whose  entries  at  1.  Then  we  can  write: 

•  m(t)  =  Bo(()T(x«_i)p 

•  k(t  -f  1)  =  Am(t)  =  T(zt)p 

•  f(t  +  1)  =  Bk(t  -f-  1)  =  BT(xt)p 

•  Pr(xi|M,^  =  r-(T(x.)^ 


Table  1.1:  Summary  of  Important  Notations 
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Chapter  2 


Equivalence  of  HMMs 


As  discussed  in  the  previous  chapter,  many  different  Hidden  Markov  Models  can  rep¬ 
resent  the  same  stocheistic  process.  Prior  to  addressing  questions  about  the  expressive 
power  of  HMMs,  it  is  important  to  understand  exactly  when  two  models  M  and  are 
equivalent  in  the  sense  that  they  represent  the  same  statistics.  In  Section  2.2  we  will 
see  how  to  determine  when  two  prior  distributions  on  a  given  HMM  induce  identical 
stochastic  processes.  Section  2.3  discusses  equivalence  of  Initialized  Hidden  Markov 
Models.  Section  2.4  shows  how  to  determine  whether  two  HMMs  are  representations 
for  the  same  claiss  of  stochastic  processes.  This  will  lead,  in  the  next  chapter,  to 
a  fundamental  characterization  of  the  degree  of  ‘‘reedom  available  in  a  given  model. 
This  characterization  will  be  used  to  reduce  HMMs  to  minimal  canonical  forms. 

2.1  Definitions 

We  begin  by  defining  what  we  mean  by  equivalence  of  Hidden  Markov  Models.  First 
of  all,  we  should  say  what  it  means  for  two  stochastic  processes  to  be  equivalent. 

Definition  2.1  (Equivalence  of  Stochastic  Processes) 

Suppose  X  and  y  are  two  stochastic  processes  on  the  same  discrete  alphabet  O.  For 
each  X  6  O'  let  Pr^(x)  be  the  probability  that  after  |x|  steps  the  process  .V  has  emitted 
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the  string  x.  Define  Pry(x)  similarly.  Then  we  say  that  A'  and  3-’  uf'f  equivalent 
processes  (A’  y)  if  and  only  ifPv,x  {x}  =  Pry(.r)  for  entry  x  E  O' 

In  Chapter  1  we  discussed  the  interpretation  of  an  Initialized  Hidtien  Markov  Model 
(IHMM)  as  a  finite-state  representation  for  a  stochastic  process,  and  we  defined  the 
probability  distribution  over  strings  induced  by  the  process.  VVe  can  use  these  defini¬ 
tions  to  say  what  we  mean  by  equivalence  of  Initialized  HMMs. 

Definition  2.2  (Equivalence  of  Initialized  HMMs) 

Let  ,Ct  and  Af  be  two  Hidden  Markov  Models  with  the  same  output  set,  and  initialized 
by  priors  p  and  q  respectively.  iVe  wish  to  say  that  these  initialized  models  are  equiv¬ 
alent  if  they  represent  the  same  stochastic  process.  So  we  say  that  .C{{p)  is  equivalent 
to  .V(<f)  {M{p)  ^  if  and  only  if  Pt(x\M,p)  =  Pr(x|A^^  for  e"ery  x  €  O’. 

This  is  the  same  as  saying  that  M{p)  ^  exactly  when,  for  every  time  t,  the  joint 

probability  of  the  output  with  the  entire  previous  output  sequence,  is  the  same  for  both 
models.  In  the  notation  of  Chapter  1  we  can  write  this  as:  BxT>t(x)/7  =  B.v'T  v'’(-r)v 
for  every  x  €  O*  U  {e}. 

In  Chapter  1  we  also  mentioned  that  different  prior  distributions  on  the  same  HMM 
could  induce  the  same  stochastic  process.  In  order  to  identify  the  conditions  under 
which  this  can  occur  we  make  the  following  definition. 

Definition  2.3  (Equivalence  of  Prior  Distributions) 

Let  p  and  q  be  two  different  prior  distributions  on  an  HMAI  =  (5,(9,  A.  B).  We 
say  thatp  andq  are  equivalent  priors  forM  (p  =  q)  if  and  only  ifMip)  <<=>  -Cf((f)  i.e., 
if  and  only  if  the  Initialized  HMMs  derived  by  fixing  the  priors  on  M  are  equivalent. 

VVe  are  now  ready  to  define  equivalence  of  Hidden  Markov  Models.  As  discussed  in 
Chaper  1,  HMMs  can  be  treated  as  finite  state  representations  for  classes  of  stochastic 
processes.  We  would  like  to  say  that  two  HMMs  are  equivalent  if  they  represent  the 
same  class  of  processes. 


2.2.  EQUIVALESCE  OF  PRIORS 


2') 


Definition  2.4  (Equivalence  and  Subset  Relations  for  HMMs) 

Lit  and  .\  be  two  HMMs  with  (he  same  output  set.  Let  p  and  q  denote  prior 
distributions  on  .M  and  .V  repsectivelij.  We  saij  that  A"  is  a  subset  of  .Vi  (.V  C 
.Vi)  if  and  only  if  for  each  q  on  A  we  can  find  a  corresponding  p  on  .Vi  such  that 
.Vi{p)  ^  A  [q).  In  other  words.  .V  is  a  subset  of  .Vi  if  and  only  if  the  class  of 
processes  represented  by  .V  is  a  subset  of  the  class  of  processes  represented  by  .Vi.  We 
can  then  write  .Vi  is  equivalent  to  A  (^Vi  ^  A^j  exactly  when  Af  C  .Vi  and  .Vi  C  A”. 

The  basic  intuition  underlying  all  the  results  concerning  the  equivalence  of  HMMs 
is  the  following:  The  output  distributions  of  an  HMM  are  linear  transformations 
that  map  an  underlying  dynamics  on  the  states  onto  a  dynamics  on  the  space  of 
observations.  Heuristically,  it  must  be  the  case  that  the  components  of  the  dynamics 
on  the  states  that  fall  in  the  null-space  of  the  output  matrix  must  represent  degrees  of 
freedom  that  are  irrelevant  to  the  statistics  on  the  outputs.  So,  for  example,  we  will 
see  that  two  prior  distributions  on  a  model  are  equivalent  if  and  only  if  their  difference 
falls  in  a  particular  subspace  of  null-space  of  the  output  matrix.  All  the  algorithms 
discussed  in  this  chapter  will  achieve  their  goals  by  rapidly  checking  properties  of 
various  vector  spaces  associated  with  HMMs. 


2.2  Equivalence  of  Priors 

When  do  two  prior  distributions  on  a  given  model  induce  the  same  stochastic  process? 
This  is  the  most  basic  question  that  we  would  like  to  answer.  Using  the  notation 
developed  in  Chapter  1,  and  the  definition  of  equivalent  Initialized  HMMs,  we  can 
write  the  condition  for  equivalent  priors  «is  follows:  p  =  q  if  and  only  if  BT(x)p  = 
BT(i)(f  for  every  x  €  O”  U  {e}.  Let  S  =  p  —  q-  Then  we  can  rephrase  this  as: 
BT(x)  [p  —  <^  =  BT(x)^  =  0  for  every  x  E  0‘  U  {e}.  In  other  words  p  =  q  if  and 
only  if  for  every  string  x  E  0‘  U  {e}  we  can  say  that  T(x)5  is  a  vector  that  falls  in 
the  null-space  of  the  output  matrix  B.  This  can  be  expressed  in  more  geometrical 
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terms  as  follows. 

Theorem  2.1  Equivalence  of  Priors  (Geometrical  Interpretation)  ' 

Suppose  M  A,B)  is  a  Hidden  Markov  Model  with  n  states  and  k  outputs. 

Let  p  and  q  be  two  prior  distributions  on  .\Awith  S  =  p  —  q.  Let  denote  the  null- 
space  of  the  linear  transformation  B  and  let  J  be  the  largest  subspace  of  .V  that  is 
invariant  under  each  of  the  transformation  operators  T(o,).  Then  p  =  q  if  and  only 

if  8  el. 

Proof:  First  of  all  suppose  S  €  T  C  Af.  Then  because  2  is  inv’ariant  under  all  the 
T(o,)  we  know  that  T(o,)^  6  2  and,  by  induction,  we  can  say  that  for  every  i  = 
[o(l),  o(2),  •  •  • ,  o(t)]  e  O*  it  is  true  that  T{x)6  =  T(ot)  •  ■  •  T(oi)6  e  2.  We  conclude 
that  T{x)S  e  .V"  for  every  x  €  0’0{e}.  Therefore,  by  our  earlier  discussion,  p  is  equiv¬ 
alent  to  q.  This  proves  the  sufficiency  of  our  condition  for  equivalence.  Next  we  prove 
necessity.  Suppose  that  p  =  q.  Then  let  T>  =  |^(x)  :  8{x)  =  T{x)S,  x  €  O*  U  {£}| 
be  the  set  of  all  differences  between  T{x)p  and  T(x)9  for  t--  string  x.  If  S{x)  is 
any  vector  in  T>  and  T(o,)  is  any  transition  operator,  then  T(o,)<^(x)  is  also  in  T>. 
So  T>  is  invariant  under  the  action  of  the  every  transition  operator  and,  therefore, 
so  is  Span{T>).  By  assumption  of  equivalence  of  priors,  every  vector  in  T>  lies  in  the 
null-space  of  B.  So  Span(T>)  C  Af.  We  conclude  that  Span{V)  is  a  subspace  of  the 
largest  subspace  of  jV  that  is  invariant  under  all  the  transition  operators.  This  proves 
the  necessity  of  our  condition  for  equivalence.  □ 

In  effect,  the  difference  between  equivalent  priors  is  a  vector  that  lies  in  a  subspace 
that  contributes  nothing  to  the  probability  distribution  over  outputs,  and  remains  in 
this  subspace  as  the  model  evolves.  It  is  not  enough  that  8  simply  be  in  the  null-space 

HVe  remind  the  ren  r  of  the  following  linear  algebraic  notions.  The  null-space  of  a  linear 
transformation  B  from  R'*  to  R*  is  the  subspace  of  R"  that  is  mapped  by  B  into  the  k-dimensional 
zero  vector.  An  invariant  subspace  of  a  linear  transformation  T  from  R"  to  R”  is  a  subspace  V 
such  that  T  maps  every  vector  in  V  into  V. 
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of  B  because  some  of  the  vectors  in  the  null-space  may  contribute  to  the  dynamics 
of  the  system  and  affect  later  distributions  over  oi‘puts.  The  fact  that  6  lies  in  an 
invariant  subspace  of  the  null-space  guarantees  that  S  will  never  (  oatribute  to  the 
distribution  over  outputs,  even  after  the  model  evolves.  Figure  2.1  shows  a  simple 
e.xample  in  which  all  the  states  have  the  same  output  distribution,  so  that  the  null- 
space  of  B  consists  of  all  vectors  that  sum  to  zero.  Furthermore,  for  every  i.  B,  is 
proportional  to  the  identity  so  that  T(o,)  is  proportional  to  A.  Since  A  is  stochastic 
it  preserves  sums  and  so  we  see  that  the  space  of  vectors  which  sum  to  zero  is  an 
invariant  subspace  of  every  T(oi).  For  any  priors  p  and  q  we  know  that  8  =  p  —  q  sums 
to  zero  since  p  and  q  are  both  stochcistic.  So,  as  we  would  expect  for  this  degenerate 
case,  Theorem  2.1  tells  us  that  all  prior  distributions  on  the  model  induce  equivalent 
stochcistic  processes. 

Although  Theorem  2.1  gives  a  good  understanding  of  why  two  priors  may  be 
equivalent  for  a  model,  it  is  not  in  a  form  that  is  immediately  useful  for  developing  a 
quick  algorithm.  So  we  prove  another  form  of  the  theorem  that  will  be  used  directly 
in  the  algorithm  of  Figure  2.2 

Theorem  2.2  Equivalence  of  Priors 

Let  M  =  (5,C7,A,B)  be  a  Hidden  Markov  Model.  Suppose  p  and  q  are  two  prior 
distributions  on  Ad  with  S  =  p  —  q.  Define  V  =  |^(x)  :  8{x)  =  T(i)  6,  x  €  O*  U  {f}|, 
and  let  V  be  any  collection  of  vectors  in  P  that  forms  a  basis  for  the  vector  space 
spanned  by  the  elements  of  D.  Then  p  ^  q  if  and  only  every  vector  in  V  lies  in  the 
null-space  of  3. 

Proof:  First  suppose  that  p  =  q.  Then  V  C  P  and  so,  from  the  previous  discussion, 
every  vector  in  V  must  fall  in  the  null-space  of  B,  proving  the  necessity  of  the  theo¬ 
rem.  Now  suppose  that  Bvj  =  0  for  every  vector  Vj  E  V.  Then,  since  V  is  a  basis  for 
the  span  of  P,  for  every  €  P  there  exists  a  collection  of  coefficients  {c.^}  such  that 
8,  =  CijVj.  So,  for  every  we  can  write  B6,  =  B  CijVj  =  c,j  (Buj)  =  0. 


28 


CHAPTER  2.  EQLTVALES(  OF  HMMS 


Figure  2-1:  Geometrical  Interpretation  of  Equivalence  of  Priors 


This  is  the  same  ais  saying  that  BT{i)^  =  0  for  every  i  €  O*  U  {e}.  Consequently, 
we  have  the  desired  result  that  p  =  q.  Q 


Theorem  2.2  provides  a  necessary  and  sufficient  condition  for  equivalence  of  priors 
on  a  Hidden  Markov  Model.  We  can  use  it  to  construct  an  algorithm  by  quickly 
generating  the  basis  V  of  the  theorem  and  checking  that  the  elements  of  the  basis  fall 
in  the  null-space  of  B.  The  algorithm  in  Figure  2.2  does  exactly  this.^  We  will  now 

^Our  procedure  for  checking  equivalence  of  priors  can  be  optimized  in  various  ways.  One  such 
optimization  will  be  presented  in  the  analysis  of  the  running  time  of  the  algorithm.  We  present  the 
algorithm  of  Figure  2.2  because  it  is  easier  to  explain. 
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argue  that  the  algorithm  is  correct  and  proceed  to  calculate  its  running  time. 


Given:  An  HMM  =  (5. 0,  A.  B) 
where  |5(  =  n,  \0\  =  k 
And  priors  p  and  q  on  M 

1.  V=  {  } 

2.  Queue=  {  6} 

9^  Step  1:  Find  a  Basis 

3.  Until  (|Queue|  =  0)  or  (|V1  =  n)  do 

4.  Let  /  =  first  element  in  Queue 

5.  Remove  /  from  Queue 

6.  If  /  ^  Span{V)  Then 

7.  Add  /  to  V 

8.  For  each  Oj  €  O  do 

9.  Add  T(oi)/ to  Queue 

^  Step  2:  Test  the  basis 

10.  For  each  y  €  V  do 

11.  If  Bn  7^  0  Then  Return(NOT-EQUIVALENT) 

12.  Return(EQUIVALENT) 


Figure  2-2:  Algorithm  for  Detecting  Equivalence  of  Priors 


Correctness:  The  algorithm  of  Figure  2.2  proceeds  in  two  steps.  In  Step  1  it 

finds  a  basis  V  and,  in  Step  2,  it  checks  the  necessary  and  sufficient  condition  for 
equivalence  given  in  Theorem  2.2.  So,  it  checks  equivalence  of  priors  correctly  if  V  is 
indeed  a  basis  for  the  span  of  P  =  |^(x)  :  6{x)  =  T(x)  S,  x  E  O"  U  {e}}.  In  order  to 
analyze  the  algorithm  we  will  use  the  terminology  that  the  vector  T(o.)j;  is  a  child  of 
the  vector  v.  When  the  basis  finding  step  of  the  algorithm  terminates,  V  contains 
a  linearly  independent  collection  of  vectors.  If  the  step  terminated  because  |V|  =  n, 
we  must  have  a  basis  for  Span{V)  since  the  vectors  in  P  are  n-dimensional.  Suppose 
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now  that  the  basis  finding  step  terminated  because  Queue  was  empty.  Each  child  of 
each  of  the  vectors  in  V  was  added  to  Queue  by  line  9.  So  each  of  these  children 
is  either  in  V  or  was  found  to  be  a  linear  combination  of  a  set  of  vectors  in  V. 
Let  C  denote  the  set  of  children  of  elements  of  V  that  are  not  themselves  in  V. 
Then  we  can  write  c,  =  Y!,vjQV  d,jVj  for  every  c,  G  C.  Suppose  now  r  6  P  is  not 
in  V  and  is  not  a  child  of  a  vector  in  V.  By  construction  of  the  algorithm  we 
can  find  some  string  x  and  some  c,  which  is  a  child  of  a  vector  in  V  such  that 
T(x)ci  =  V.  We  wish  to  show  that  every  such  v  is  in  the  span  of  V.  We  will  do 
this  by  induction  on  the  length  of  the  string  x.  If  |x|  =  1  so  that  x  =  o*  6  C?, 
then  for  some  c.  we  know  that  v  =  T(o/t)c,  =  T(o)t)  =  T.vjS'v  d,jT{ok)vj. 

So  we  see  that  v  is  a  linear  combination  of  children  of  elements  of  V,  which  all 
necessarily  fall  in  the  span  of  V.  Hence  v  falls  in  the  span  of  V  if  u  =  T(x)c,  for 
any  x  of  length  one  and  any  c,  €  C.  Now  assume  that  for  every  x  such  that  |x|  <  t 
we  know  that  v  =  T(x)c,  is  in  the  span  of  V.  So  we  write  that  v  =  d^jV) . 

Then  for  every  string  y  =  xok  of  length  t  -I-  1  we  know  that  there  is  a  cj  such  that 
d  =  T(j/)c,  =  T(ofc)T(x)c,  =  T(ofc)i;  =  T(ofc)  J^Cj^vdvjVj.  Taking  the  multiplication 
by  T(ofc)  into  the  sum  we  see  that  u  is  a  linear  combination  of  vectors  in  V  and  their 
children,  all  of  which  fall  in  Span{V).  So  u  €  5pan(V)  also.  By  induction  on  t  =  |x|, 
all  u  €  P  are  in  the  span  of  V.  Therefore,  as  claimed,  V  is  a  basis  for  the  span  of 
the  vectors  in  P.  The  second  step  of  the  algorithm  then  evaluates  the  necessary  and 
sufficient  condition  of  Theorem  2.2  on  the  basis  generated  in  the  first  step.  Therefore, 
our  algorithm  is  correct.  □ 

Running  Time:  We  will  now  compute  the  worst  case  running  time  of  the  equiv¬ 
alent  priors  algorithm  asuming  unit  cost  arithmetic  operations.  Once  the  basis  V  is 
generated  in  Step  1,  the  check  performed  in  Step  2  takes  0{n‘^k)  time  since  |V|  <  n 
and  each  multiplication  by  B  takes  time  0{nk).  In  addition,  it  takes  0{n^k)  time  to 
generate  all  the  T(o,)  matrices  used  in  the  algorithm  from  the  given  A  and  B  ma¬ 
trices.  To  analyze  Step  1,  we  observe  that  each  multiplication  of  /  by  T(o,  )  in  line  9 
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takes  time  0{n^).  In  the  worst  ciise  the  basis  V  generated  in  Step  1  will  contain  n  el¬ 
ements.  For  every  u  G  V  and  every  o,  6  O,  line  9  adds  all  vectors  T(o,  )tJ  to  Queue. 
So,  in  all,  time  0{n^  ■  nk)  =  0{n^k)  could  be  spent  extending  Queue.  The  final 
contribution  to  the  running  time  is  from  the  check  in  line  6  of  the  al^^orithm  to  see 
if  /  should  be  added  to  the  partially  generated  basis.  We  observe  that  /  ^  Span('V) 
can  be  tested  in  time  0(n(V|^  4-  |Vp)  by  standard  Gaussian  elimination. [press90] 
In  the  worst  Ccise,  the  first  n  —  1  vectors  that  are  tested  in  line  6  will  be  added 
to  the  ba.sis,  and  all  the  remaining  nk  —  (n  —  1)  vectors  in  Queue  will  have  to  be 
tested  to  find  the  last  bcisis  vector.  So,  for  large  k  and  n,  these  tests  will  take  time 
0{n^)  •  0(nk)  =  0{n*k).  This  gives  an  0{n*k)  running  time  for  the  algorithm.  We 
can  do  better  by  being  a  little  more  clever  about  the  test  in  line  6.  An  optimized 
algorithm  would  maintain,  in  addition  to  the  ba.sis  set  V,  a  set  U  of  orthonormal 
basis  vectors  produced  by  applying  the  Gram-Schmidt  procedure  to  V.  Every  time  a 
vector  /  is  extracted  from  Queue,  it  is  orthogonalized  against  the  current  set  U.  If 
the  residue  of  this  procedure  is  the  zero  vector,  /  is  in  Span{lJ)  =  5/)«n(V),  and  so  / 
is  thrown  away.^  If  the  residue  is  non-zero,  /  is  added  to  V  and  the  residue  is  added 
to  U.  The  Gram-Schmidt  procedure  would  take  time  0(n|V|)  since  it  just  involves 
projection  of  /  onto  each  of  the  vectors  in  U  and  |U|  =  |V|.  Repeating  the  earlier 
analysis  gives  a  worst  case  running  time  of  0{n^k)  for  this  optimized  algorithm. 

The  next  section  uses  this  result  concerning  equivalence  of  priors  to  develop  an 
algorithm  to  test  equivalence  of  Initialized  Hidden  Markov  Models. 

2.3  Equivalence  of  Initialized  HMMs 

In  order  to  develop  an  algorithm  to  check  equivalence  of  Initialized  Hidden  Markov 
Models  we  will  utilize  a  popular  trick  from  the  theory  of  Finite  Automata.  Given  two 
models  we  will  build  a  new  HMM  whose  properties  will  enable  us  to  check  equivalence 

^We  are  using  the  term  “residue”  to  mean  the  piece  of  a  vector  that  is  left  after  removing  all 
components  along  vectors  in  a  given  set. 
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of  the  two  given  models  easily.  {See  Figure  2.3)  Suppose  M  —  (Sm-  O.  A.v<,  B,m)  and 
A  =  (<5vi  O,  A,v'-  Bv)  are  two  HMMs  initialized  by  priors  p  and  q  respectively.  Then 
we  construct  a  new  HMM  Q  =  (.Sq,  Oq.  Aq,  Bj^)  where  Sq  =  5vi  u5a'  and  Oq  =  O. 
If  .V(  has  m  states,  A*  has  n  states  and  \0\  =  k  we  define; 


Aq 

Be 


Am 

Om  X  n 

Onxm 

Aat 

Bv  1 

(2.1) 

(2.2) 


(VVe  are  using  the  notation  O.xi  for  the  i  by  i  matrix  whose  entries  are  all  zero.)  Es¬ 
sentially,  Q  consists  of  two  disjoint  HMMs,  M  and  Af,  which  have  been  concatenated 
together  as  in  Figure  2.3.  Let  pq  =  [p.Oat]  be  a  prior  on  Q  such  that  it  equals  the 
prior  p  on  the  states  corresponding  to  M  and  is  zero  on  the  states  corresponding  to 
.V.  Also  define  qQ  =  similarly.  Then,  by  construction,  it  must  be  true  for 

any  i  6  O’  U  {e}  that  Pr(i|Ad,^  =  Pr(i|<2,pc)  ^^so  Pr(i|A/',(f)  =  Pr(x|C,9c)- 
So  M{p)  ^  if  and  only  if  pq  and  qq  are  equivalent  priors  for  our  new  HMM 
Q.  Therefore,  as  a  corollary  of  the  results  from  the  previous  section,  we  can  check 
equivalence  of  two  initialized  Hidden  Markov  Models  in  0{{n  +  m)^k)  time  if  the 
models  have  n  and  m  states  respectively  and  share  an  output  set  of  size  k. 

In  the  next  section  we  will  investigate  algorithms  for  deciding  subset  relations  and 
and  equivalence  of  Hidden  Markov  Models. 


2.4  Equivalence  of  Hidden  Markov  Models 

In  Chapter  1  we  discussed  the  interpretation  of  HMMs  as  representations  for  clcisses 
of  stochastic  processes,  whose  elements  are  derived  by  initializing  prior  distributions 
on  the  models.  Definition  2.4  defined  an  HMM  Af  to  be  a  subset  of  an  HMM  M  [Af  C 
j\4)  when  every  process  that  can  be  represented  by  Af  can  also  be  represented  by  A4. 
Equivalence  of  Hidden  Markov  Models  was  defined  by  saying  M  <=>  yV"  exactly  when 
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To  test  equivalence  of  two  Initialized  HMMs,  M  and  we  first  construct  a  larger 
HMM  Q,  which  contains  A4  and  A/*  as  disjoint  internal  chains.  If  p  and  q  are  the 
fixed  priors  on  M  and  respectively,  checking  equivalence  of  the  priors  (p,  0)  and 
(0,  ^  for  the  model  Q  should  check  that  M  and  A/*  are  equivalent  Initialized  HMMs. 

Figure  2-3:  Checking  Equivalence  of  Initialized  HMMs 


M  C  J\f  and  C  M.  This  definition  partitions  HMMs  into  disjoint  equivalence 
clcLsses  that  are  representations  for  the  same  sets  of  stochastic  processes.  (This  does 
not,  of  course,  partition  the  stochastic  processes  representable  by  HMMs  into  disjoint 
cleisses  since  a  given  process  may  be  representable  by  non-equivalent  HMMs.)  Our 
goal  in  the  next  chapter  will  be  to  find  a  way  of  generating  a  minimal,  canonical 
representative  of  each  equivalence  cleiss  in  order  to  isolate  the  essential  expressive 
degrees  of  freedom  in  an  HMM.  Producing  such  canonical  representations  will  -ilso 
reduce  the  computational  overhead  involved  in  the  use  of  large  models.  As  a  prelude, 
in  this  section,  we  will  develop  an  algorithm  that  will  check  whether  two  models 
M  and  A'  are  in  a  subset  relation  to  each  other.  A  corollary  will  let  us  check 
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equivalence  of  Hidden  Markov  Models.  We  will  build  up  to  the  algorithm  and  the 
associated  characterization  of  equivalent  HMMs  by  proving  a  series  of  lemmas. 

Let  .Vfi  =  C9,  Ai,  Bi )  and  .\42  =  (<52,  C?.  A.).  B^)  be  two  Hidden  Markov 

Models.  From  the  definitions  we  see  that  ,Vt2  Q  exactly  when  for  every  prior 
pi  on  .VI2  we  can  find  a  prior  pi  on  -Vfi  that  makes  .V4i(pi)  .^i{pi)-  Fsing  the 
definition  of  equivalent  Initialized  HMMs  (Definition  2.2)  we  can  write  this  as;  for 
every  prior  pi  on  Mi  there  exists  a  prior  pi  on  Mi  such  that  r  E  0‘  U  {e}  we  can 
write  BiTi(i)pi  =  B2T2(x)p2.  This  implies  the  following  lemma  which  essentially 
says  that  there  is  a  stochastic  matrix  that  transforms  the  priors  on  one  machine  into 
equivalent  priors  on  the  other. 

Lemma  2.1  Transformation  of  Priors 

If  Ml  =  (5i,0,Ai,Bi)  and  Mi  =  («S2, 0,  A2,  B2)  then  Mi  C  Mi  if  and  only  if 
there  exists  a  stochastic  matrix*  C  such  that  Vi  €  C?*  U  {f},  BiTi(x)C  =  B2T2(x). 


Proof:  First,  suppose  Mi  C  Mi.  Let  Cili)  be  a  prior  on  M2  with  all  its  mass 
on  state  s,.  Let  pi(i)  be  the  corresponding  prior  on  Mi  such  that  Vx  G  O*  U 
{c},  BiTi(i)pi(i)  =  B2T2(x)e2(i).  Such  an  pi(i)  exists  by  assumption  of  2  C  Mi- 
Let  C  be  a  matrix  whose  column  is  pi(r).  In  other  words,  C  =  [pi(l)|pi(2)|  •  ■  •  |pi(u2)J 
where  Ui  is  the  number  of  states  in  A42-  It  is  clear  that  any  prior  on  Mi  can  be 
written  as  pi  =  and  that  we  will  have: 

Vx  €  C?*  U  {c},  B2T2(x)p2  =  B2T2(x)  Pi €2(1) 

1=1 

=  BiTi(x)^PiPi(0 

1=1 

=  BiTi(x)Cp2 


‘‘By  “stochastic  matrix”  we  mean  a  matrix  whose  entries  are  all  non-negative  and  whose  columns 
sum  to  one 
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Since  this  is  true  for  any  p2  we  can  conclude  that  if  C  .Vtj  then  Vj  €  •U{e}  we 

can  write  BiT,(x)C  =  B2Tj(x).  Furthermore,  by  construction,  C  is  stochastic. 
To  prove  the  lemma  in  the  other  direction,  suppose  that  the  matrix  C  exists  and, 
for  any  prior  p2  on  M2,  let  pi  =  Cp2  be  the  corresponding  prior  on  Mi.  Then, 
by  the  definition  of  equivalence,  Mi(pi)  <t=>  .^2(^2)  since  Vx  6  0~  U  {f}  we  can 
write  BiTi(x)(Cp2)  =  B2T2(x)p2-  Since  this  is  true  for  any  p2  we  conclude  that 
.Vf2  C  Ml.  □ 

Lemma  2.1  is  not  a  sufficiently  powerful  characterization  of  equivalence  of  HMMs 
to  enable  us  to  construct  an  algorithm  to  check  equivalence.  Essentially,  we  want  to 
find  a  necessary  and  sufficient  condition  that  does  not  require  us  to  examine  every 
finite  prefix  of  outputs  of  a  process  in  order  to  check  the  equivalence  of  models.  Our 
previous  results  achieved  this  goal  by  examining  the  properties  of  various  vector  spaces 
and  checking  an  equivalence  condition  on  their  bases.  The  next  lemma  we  prove  will 
tell  us  how  to  find  such  a  vector  space  that  allows  us  to  relax  the  equivalence  condition 
in  Lemma  2.1.  In  order  to  do  this  we  need  to  introduce  a  little  additional  notation. 

Definition  2.5  Suffix  Matrix 

Let  M  =  (5,  O,  A,  B)  be  an  HMM.  Define  a  suffix  matrix  E(x)  =  BTfx)  for  every 
X  6  C?*  U  {c}.  So  =  Pr(At  emits  xo^\M  started  in  state  Sj).  The  name 

suffix  matrix  originates  from  the  observation  that  if  z  =  xy  is  a  string  with  prefix 
X  and  suffix  y,  then  S(x)  =  I](y)T(i).  Suppose  y  is  any  string  in  O' .  Then  we  can 
always  write  y  =  xoi  where  6  O  and  x  €  C?*  U  {e}.  For  any  y  =  xo,  €  O'  we 
will  use  the  notation  a{y)  to  mean  the  row  o/S(x).  The  component  of  5[y) 
satisfies  the  equation  (T{y)j  =  Pr(./V(  emits  y\M  started  in  state  Sj). 

Lemma  2.1  implies  that  if  M2  Q  Mi,  then  linear  dependence  amongst  the  rows  of 
Ei(x)  implies  dependence  amongst  the  rows  of  Il2(x).  This  provides  a  clue  that  the 
key  to  equivalence  of  HMMs  lies  in  comparing  the  spaces  spanned  by  the  rows  of  the 
suffix  matrix.  Investigating  this  idea  leads  to  the  following  lemma. 
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Lemma  2.2  Equivalence  Condition 

Let  Ml  =  («5i,  O,  Ai,  Bi )  and  M2  =  (S2,  O.  A2,  Bj)  he  two  Hidden  Markov  Models. 
Let  Ui  =  :  y  €  O’}  be  the  set  of  all  rows  of  the  suffix  matrices  of  Mi.  Let 

V  =  o'i(x2),  •  •  •  be  a  basis  for  Span{Ui).  Then  M  y  Q  Mi  if  and  only 

if  there  exists  a  stochastic  matrix  C  that  satisfies  the  following  conditions: 

Voj  6  O.  (7i(ot)C  =  ^2(Ok)  (2.3) 

Vx,  such  that  (7i(x,)  €  V,  cri(x,)C  =  (2.4) 

Voj  €  O  anrf  V<Ti(x)  e  V,  <Ti(x)  [Ti(oj)C  -  CT2(oj)]  =  0  (2.5) 

Prior  to  proving  this  lemma  it  will  help  to  gain  some  intuition  for  what  it  means. 
Remember  that  the  matrix  C  in  Lemma  2.1  transforms  priors  on  M2  into  priors 
on  Ml,  and  that  the  component  of  ^i(x)  is  the  probability  of  emitting  string 

X,  having  started  in  state  Sj.  Using  these  two  facts  we  can  see  that  Equation  2.4 

says  that  that  for  any  choice  of  priors  on  M2  there  is  a  prior  on  Mi  such  that  the 
probability  of  emitting  a  string  y  is  the  same  for  both  models  if  cri(y)  is  in  the  basis 
for  Span  (Idi).  Equation  2.3  says  the  same  thing  for  all  strings  of  length  one.  We  will 
eventually  use  these  two  facts  in  the  base  case  of  an  induction  to  prove  the  lemma. 
We  will  see  that  Equation  2.5  is  a  way  of  saying  that  if  ai(x)C  =  a2(x)  for  some  x 
then  this  condition  is  also  satisfied  for  any  string  y  that  is  one  symbol  longer  than  x. 
We  will  use  this  as  the  induction  step  in  the  proof  below. 

Proof:  First  we  will  prove  that  if  M2  C  ^Vlj  then  Equations  2.3  to  2.5  will  be  true. 
So  suppose  that  M2  C  Mi-  Then  by  Lemma  2.1,  there  is  a  stochastic  matrix  C  such 
that  for  every  x  €  0‘  U  {e},  every  row  ai(xoi)  of  Ei(x)  satisfies  <Ti(xo,)C  =  <72(xo,) 
where  a2(xo,)  is  the  corresponding  row  of  S2(x).  This  at  once  makes  Equations  2.3 
and  2.4  true.  Then  we  turn  to  Equation  2.5.  Let  x  be  any  string  in  O*  U  {f}  and 
let  y  =  o,x  be  an  |x|  -f  1  long  string  with  x  as  a  suffix.  Then  by  assumption  of 
M2  Q  Afi,  and  using  the  definition  of  the  suffix  matrix  we  can  make  the  following 
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series  of  statements: 


<7l(//)C  =  <T2(//) 
^i(x)Ti(o.)C  =  5?j(x)T2(o.) 


^i(^)Ti(o,)C  =  (7i(.r)CT2(o.) 


=>,t,(x)[T,(o.)C-CT2(o,)]  =  0 


(2.6) 


The  second  equation  is  derived  from  the  first  from  the  definition  of  !Ii(x)  and  (t(x). 
The  third  equation  simply  replaces  d'2(-r)  by  (Ti(x)C  by  assumption  of  ,\d2  C  Mi 
and  Lemma  2.1.  Since  Equation  2.6  holds  for  every  o,  6  O  and  for  every  x  such  that 
^i(^}  €  V,  we  have  proven  the  necessity  of  the  lemma.  Next  we  will  prove  the  lemma 
in  the  other  direction.  Suppose  that  a  stochastic  matrix  C  satisfying  the  conditions 
of  the  lemma  exists.  Then,  by  Equation  2.3,  ai(x)C  =  ^2(1)  for  every  string  x  of 
length  1.  Then  assume  that  for  all  x  of  length  less  than  or  equal  to  /  we  can  write 
ai(x)C  =  ^2(2:).  For  any  such  x  (jxl  <  /)  we  can  write  <Ti(x)  =  d,ai(x,)  for 
some  choice  of  </,,  where  the  ^i(x,)  are  elements  of  the  btisis  V.  So,  by  the  induction 
assumption,  and  Equation  2.4,  we  can  write:  ^2(2^)  =  (7i(x)C  =  d,(7i(x,)C  = 

Ill=i  But  this  means  that  for  every  output  o,,  we  can  use  Equation  2.5  to 

write: 


|V| 


5l(x)Ti(o.)C  =  ^(i.,7i(x.)Ti(o.)C 

1=1 

(2.7) 

|V| 

^d.^i(x.)CT2(o,) 

t=l 

(2.8) 

|V| 

‘^•^2(i.)T2(o,) 

1  =  1 

(2.9) 

^2(^)T2(o,)  =  ^2(0, X) 

(2.10) 

We  go  from  Equation  2.7  to  Equation  2.8  by  a^.plying  condition  2.5  of  the  lemma.  The 
next  two  lines  simply  substitute  the  expression  for  a2(x)  obtained  from  the  induction 
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assumption.  The  conclusion  is  that  if  ^i(j-)C  =  for  all  strings  x  of  length  less 

than  or  equal  to  /.  then  the  same  is  true  for  strings  of  length  /-el.  This  completes 
the  induction  and  proves  that  for  every  x,  vve  can  write  dil  rjC  =  implying 

that  Vj  6  O*  U  {e}I!i(x)C  =  By  Lemma  2.1.  this  shows  that  Mi  C  Mi.  □ 

Lemma  2.2  could  be  used  to  build  a  polynomial  time  algorithm  for  testing  equiv¬ 
alence  of  HMMs.  Such  an  algorithm  would  begin  by  generating  the  basis  V  in  the 
lemma.  We  would  use  the  efficient  basis-generation  technique  used  in  Step  1  of  our 
algorithm  for  checking  equivalence  of  prior  distributions.  Then  we  would  use  linear 
programming  techniques  to  find  a  matrix  C  satisfying  t.he  conditions  of  the  lemma.^ 
Mi  C  Ml  only  if  such  a  matrix  is  found.  Since  linear  programming  problems  can 
be  solved  quickly,  such  an  algorithm  would  run  in  polynomial  time.([karmarkar84]) 
However,  it  is  possible  to  do  even  better.  Some  recent  results  in  the  theory  of  proba¬ 
bilistic  automata  ([tzeng]),  that  are  achieved  using  methods  similar  to  ours,  suggest 
that  the  following  lemma  should  be  true. 

Lemma  2.3  All  C  matrices  are  equivalent 

Let  Cyand  Cibe  any  two  stochastic  matrices  satisfying  ^i(x,)Ci  =  ai(x,)C2  for  every 
€  V,  where  V  is  the  basis  in  Lemma  2.2.  Then,  for  any  string  x  we  can  write 
(Ti(x)Ci  =  ai{x)C2. 

Proof:  Suppose  x  is  any  string.  Then  for  some  choice  of  d,  we  know  that  ai(x)  = 
diai{xi)  where  the  ^i(x,)  are  the  elements  of  the  basis  in  Lemma  2.2.  Then  it 
is  clear  that  iTi(i)Ci  =  d,ai{x,)Ci  =  d,ai{x,)C2  =  ai(x)C2.  □ 

Collecting  all  our  lemmas  together,  we  can  finally  state  our  theorem  characterizing 
equivalent  Hidden  Markov  Models. 

®We  need  to  use  linear  programming  rather  than  straightforward  linear  algebra  because  the 
stochasticity  constraints  on  C  involve  inequalities. 
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Theorem  2.3  Equivalence  of  HMMs 

Let  .Vfi  =  («5i,  O,  Ai,  Bi )  and  M2  =  (<52,  C?,  A2,  B2)  be  two  Hidden  Markov  Models. 
Let  Ux  =  {^i{y)  :  y  €  O'}  be  the  set  of  all  rows  of  the  suffix  matrices  of  Mi.  Let 

V  =  {i5i(xi), CTi(j2)»  •  •  •  be  a  basis  for  Span{Ui).  Then  M2  C  Mi  if  and 

only  if  the  following  two  conditions  hold,  (a)  There  exists  a  stochastic  matrix  C  such 
that  for  every  x,  satisfying  ai{x,)  E  V  we  can  write  ai(x,}C  =  ^2(^1)-  (b)  For  any 
stochastic  C  satisfying  condition  (a),  the  following  must  be  true: 

Voj  €  O,  <?i(o*:)C  =  (72(04)  (2.11) 

Vo,  €  O  and  Vai(i)  €  V,  ^i(x)[Ti(oj)C-CT2(o,)]  =  0  (2.12) 

Ml  ^  M2  if  and  only  if  M2  Q  M\  and  Mi  C  M2. 


Proof:  The  proof  follows  easily  from  Lemmas  2.2  and  2.3.  Suppose  the  conditions 

(a)  and  (b)  of  our  theorem  hold,  and  pick  any  C  satisfying  them.  This  C  also  satis¬ 
fies  the  conditions  of  Lemma  2.2  so  that  M2  Q  Mi.  So  conditions  (a)  and  (b)  are 
sufficient  to  guarantee  that  A^2  Q  Mi.  Next  we  show  that  they  are  also  necessary 
conditions.  So  suppose  that  M2  Q  Mi-  First  notice  that  Equation  2.11  says  that 
(7i(x)C  =  0^2(2:)  for  every  string  x  of  length  1.  Also  remember  from  the  proof  of 
Lemma  2.2  that  Equation  2.12  essentially  says  that  if  ai[x)  €  V,  then  any  string 
y  =  o,x  satisfies  the  condition  ai{y)C  =  S2{y).  Lemma  2.3  tells  us  that  if  Cj  and 
C2  both  satisfy  condition  (a),  then  ^i(x)Ci  =  ai(x)C2  for  any  string  i.  So,  if  any 
C  satisifies  condition  (a)  and  the  equations  of  condition  (b),  then  every  C  satisfying 
(a)  also  satisfies  condition  (b).  By  Lemma  2.2  there  is  a  stochcistic  matrix  C  satisfy¬ 
ing  condition  (a)  and  Equations  2.11  and  2.12.  Therefore,  as  discussed  above,  every 
C  fulfilling  condition  (a)  also  satisfies  the  equations  of  (b).  This  proves  that  the  (a) 
and  (b)  are  necessary  conditions  for  M2  ^  Mi  to  be  true.  We  have  already  shown 
that  they  are  sufficient  conditions  and  so  our  proof  of  the  theorem  is  complete.  □ 
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Algorithm:  We  can  use  Theorem  2.3  to  develop  a  polynomial  time  algorithm  to  test 
equivalence  of  HMMs.  We  do  this  by  first  checking  if  .Vdj  C  ,\Ai  and  then  checking 
Ml  C  Mi-  So  suppose  we  are  trying  to  check  that  Mi  C  Mi-  The  subset-checking 
algorithm  starts  by  generating  the  basis  of  Theorem  2.3  using  the  method  of  Step  1 
in  the  algorithm  for  determining  equivalence  of  priors.  It  then  tries  to  find  a  matrix 
C  satisfying  the  equivalence  condition  (a)  for  this  basis.  If  no  such  matrix  can  be 
found,  then  Mi  %,  ^Vfi.  If  a  C  satisfying  condition  (a)  is  found,  we  check  that  it 
satisfies  the  equations  of  condition  (b).  If  it  passes  this  test,  Lemma  2.2  tells  us  that 
Mi  C  Ml-  We  check  Mi  C  Mi  similarly  and  answer  the  question  of  equivalence 
appropriately.  Correctness  of  this  algorithm  is  immediate  from  the  correctness  of  our 
earlier  algorithm  to  determine  equivalence  of  priors,  and  from  Theorem  2.3. 

We  will  now  compute  the  running  time  of  the  HMM  equivalence  algorithm,  cis- 
suming  unit  cost  arithmetic.  First  of  all,  it  takes  0{k{n\  A  nl))  time  to  generate  all 
the  Ti(o,)  and  T2(oi)  matrices  from  the  parameters  of  the  HMMs.  From  our  earlier 
analysis,  the  basis-finding  algorithm  takes  worst-case  time  O(njfc)  when  appropri¬ 
ately  optimized.  We  also  need  to  compute  ai{xi)  corresponding  to  the  ^1(1,)  G  V. 
This  can  be  done  at  the  same  time  that  the  basis  is  generater  nply  adding  a  factor 
of  2  to  the  cost.  Once  the  basis  is  generated,  finding  a  mairix  C  satisfying  condi¬ 
tion  (a)  involves  solving  a  system  of  ujIVI  equations  in  nin2  variables,  subject  to 
rii  +  nini  stochasticity  constraints.  Since  the  constraints  involve  only  linear  inequali¬ 
ties  (the  columns  of  C  sum  to  one  and  Vi,  j  C,j  >  0)  we  can  solve  for  C  using  linear 
programming. ([chvatalSO])  Karmarkar  ([karmarkar84])  gives  a  worst-case  0{Ln^^) 
time  algorithm  for  linear  programming  where  n  is  the  number  of  variables  and  L 
is  size  of  the  linear  program  in  bits.  (This  is  also  competitive  in  practice  with  the 
simplex  algorithm.)  It  is  a  somewhat  sticky  business  to  translate  the  bit  complexity 
in  terms  of  L  into  a  complexity  in  terms  of  the  number  of  variables  and  equations  in 
the  linear  program.  In  rough  terms,  if  we  are  dealing  with  a  fixed  number  of  bits  per 
number,  we  can  say  that  L  is  of  the  order  of  0(mn),  where  mn  is  roughly  the  size  of 
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the  linear  programming  tableau.  Using  this,  we  conclude  that  we  can  find  C,  if  a  solu¬ 
tion  exists,  in  worst-ca.se  time  O  [(n2|V’(  +  +  uiUi  =  0[(nin2)^  ^]  where 

vve  .have  used  the  fact  that  |V|  <  n^.  Once  we  have  generated  a  matrix  C,  checking 
that  it  satisfies  Equation  2.11  takes  time  0(/:nin2)  and  checking  Equation  2.12  takes 
time  O  [nifc(nj  -|-  (Once  again,  we  have  used  the  fact  that  jVj  <  ui.) 

Gathering  all  these  terms  together,  and  picking  the  dominant  terms  as  rii,  n2  and 
k  grow  large,  we  find  that  our  algorithm  for  checking  *Vf2  C  Mi  runs  in  worst-case 
time  0  {nik{n]  4-  4-  2nin2)  4-  (nin2)®  ®).  The  complexity  of  checking  Mi  C  M2 

is  obtained  by  exchanging  ni  and  02  everywhere  in  this  expression.  The  algorithm 
presented  here  can  be  optimized  in  various  ways  to  do  somewhat  better,  but  these 
optimizations  are  less  interesting  and  more  complicated  to  explain. 


Chapter  3 

Reduction  to  Canonical  Forms 


In  the  previous  chapter  we  defined  equivalence  of  stochastic  processes  and  proved  how 
and  why  prior  distributions  on  a  model  may  be  equivalent.  We  used  these  results  to 
characterize  equivalent  Initialized  Hidden  Markov  Models.  Finally,  we  made  various 
appeals  to  linear  algebraic  arguments  to  develop  necessary  and  sufficient  conditions 
for  the  equivalence  of  HMMs.  However,  our  results  concerning  equivalent  HMMs 
did  not  give  a  clear  intuitive  characterization  of  the  intrinsic  expressiveness  of  Hidden 
Markov  Models.  In  an  effort  to  achieve  such  a  characterization,  this  chapter  will  define 
the  canonical  dimension  of  a  model.  The  definition  is  related  to  our  formulation  of 
the  theorems  describing  equivalent  HMMs,  and  will  lead  quickly  to  an  algorithm  for 
finding  canonical  representations  of  models.  All  the  theorems  in  this  section  will 
be  proven  in  the  context  of  Generalized  Markov  Models  (GMMs)  which  relax  the 
postitivity  constraints  on  the  parameters  of  HMMs.  We  will  see  that  all  processes 
that  can  be  modelled  exactly  by  Hidden  Markov  Models  can  also  be  modelled  by 
Generalized  Markov  Models.  Some  kinds  of  GMMs,  with  appropriate  restrictions 
placed  on  the  allowable  prior  distributions,  are  equivalent  to  HMMs.  In  Section  3.2.1 
we  will  see  how  the  results  achieved  in  this  chapter  should  be  modified  to  apply 
to  HMMs.  We  begin  by  defining  Generalized  Markov  Models  and  discussing  their 
properties. 
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3.1  Generalized  Markov  Models 

In  this  section  we  will  define  a  new  class  of  models  of  stochastic  processes.  Since  this 
new  class  contains  the  processes  modelled  by  traditional  Hidden  Markov  Models,  we 
will  christen  it  the  class  of  Generalized  Markov  Models.  Essentially,  the  generalization 
involves  relaxing  the  positivity  constraint  imposed  by  the  probabilistic  interpretation 
of  the  parameters  describing  the  underlying  Markov  Chain  of  an  HMM.  First  we  will 
discuss  why  such  a  generalization  may  be  a  good  idea,  and  then  we  will  proceed  to 
define  GMMs  and  describe  their  properties. 

3.1.1  Why  Should  We  Invent  GMMs? 

Empirical  Reasons:  Our  first  motivation  for  definining  GMMs  is  empirical.  L. Niles, 
in  discussing  the  connections  between  stochastic  classifiers  and  neural  network  schemes, 
describes  experiments  with  an  HMM-net,  a  network  implementation  of  an  HMM.[niles90] 
He  reports  that  corrective  training  methods  lead  to  HMM-net  parameters  that  vi¬ 
olate  probability  constraints,  but  are  more  more  successful  in  cltis'sification  tasks. 
Niles  points  out  that  relaxing  the  stochasticity  constraint  on  HMM  parameters  whih 
preserving  the  formal  structure^  results  in  a  perfectly  valid  classifier  and  decision 
boundary  model.  Of  course,  the  Bayesian  formulation  of  clcissification  is  lost.  How¬ 
ever,  Bayesian  methods  are  only  optimal  if  the  true  distributions  are  known,  and  this 
is  very  far  from  the  case  in  most  applications  of  HMMs.  In  light  of  these  facts,  Niles 
suggests  that  HMMs  with  “negative  parameters”  may  be  interesting  because,  in  the 
HMM-net  formulation  of  Hidden  Markov  Models,  they  have  a  natural  interpretation 
as  inhibitory  connections.  If  we  wish  to  follow  this  lead  and  investigate  the  properties 
of  various  HMM-like  models  we  should  be  able  to  analytically  compare  the  properties 
of  the  different  schemes  in  order  to  be  able  to  choose  between  them  in  a  principled 

^By  formal  structure  we  mean,  for  example, the  formal  manipulations  by  which  posterior  proba¬ 
bilities  are  extracted  from  the  model.  Of  course,  once  the  model  parameters  cannot  be  interepreted 
as  probabilities,  we  will  be  computing  some  non-probabilistic  score. 
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manner.  This  thesis  initially  arose  from  an  attempt  to  understand  the  properties  of 
HMMs  sufficiently  well  to  facilitate  comparison  with  other  classification  schemes.  The 
Generalized  Markov  Models  we  will  define  in  this  chapter  are  a  natural  generaliza¬ 
tion  of  HMMs  which  follow  the  empirical  lead  in  [nilesOO]  suggesting  that  "negative 
parameters’  may  be  a  good  idea.  We  are  able  to  describe,  in  detail,  the  conections 
between  GMMs  and  HMMs. 

Theoretical  Reasons:  We  are  also  motivated  to  define  Generalized  Markov  Models 
from  a  theoretical  perspective.  First  of  all,  we  will  take  the  view  that  an  HMM  is 
simply  an  iterative,  finite-state  scheme  used  to  represent  the  statistics  of  stochastic 
processes.  The  interpretation  of  the  model  parameters  as  probabilities  is  peripheral 
to  the  actual  goal  of  realizing  parsimonious  and  easily  manipulated  representations 
of  wide  classes  of  stochastic  processes.  Therefore,  there  is  no  intrinsic  reason  why 
the  paramaters  of  the  model  should  be  probabilities,  unless  we  derive  a  clear  benefit 
from  the  constraints  imposed  by  such  an  interpretation.  If  we  discover  that  allowing 
negative  parameters  in  our  model  permits  us  to  build  better  models,  we  should  not 
allow  the  probabilistic  viewpoint  to  stop  us.  Secondly,  in  vague  terms,  all  the  results 
from  the  previous  chapter  dealt  with  general  linear  combinations  of  elements  of  vector 
spaces  as  opposed  to  convex  combinations  of  vectors  on  simplices.  (Probabilistic 
parameter  spaces  normally  lead  to  the  latter  situation.)  It  seems  natural,  therefore, 
to  ask  whether  it  is  really  necessary  for  the  parameters  of  an  HMM-like  model  to  be 
positive  in  order  to  successfully  model  stochcistic  processes.  For  example,  we  may  be 
able  to  define  a  prior  with  “negative”  parameters,  without  changing  the  probability 
distributions  over  outputs  that  we  care  about.  Suppose  p  is  a  prior  on  a  model  Ad, 
and  X  is  an  invariant  subspace  of  the  null-space  of  the  output  matrix.  Then  we 
can  remove  the  components  of  a  p  that  lie  in  J  and  the  resulting  vector  p'  will 
induce  the  same  stochastic  process  on  Ad.  (See  the  theorems  in  Section  2.2)  Notice 
that  p'  may  have  negative  components,  although  it  must  still  sum  to  one  since  the 
vectors  in  X  necessarily  sum  to  zero.  Given  this  fact,  define  a  valid  prior  to  be  any  (not 
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necessarily  stochastic)  vector  that  induces  a  valid  stochastic  process  when  it  initializes 
a  model.  Clearly,  from  the  above  discussion,  the  set  of  valid  priors  extends  beyond 
the  probability  simplex.  Extending  the  argument,  we  could  permit  the  columns  of  the 
transition  matrix  A  of  a  model  to  also  be  pseudo-stochastic.^  A  generalized  model, 
defined  by  relaxing  constraints  in  this  fashion,  has  the  potential  to  model  a  wider 
class  of  processes  with  the  same  number  of  states.  This  is  particularly  important 
in  pattern  recognition  applications  because  it  is  usually  far  from  clear  that  the  true 
model  of  the  system  is  a  probabilistic  function  of  a  Markov  Chain.  Typically,  the 
best  we  can  hope  for  is  to  approximate  the  statistics  of  a  process  as  closely  as  possible 
with  our  model.  Therefore,  a  more  expressive  formalism  could  intrinsically  provide  a 
better  model. 

Reasons  of  Parsimony:  The  final  reason  to  consider  Generalized  Markov  Models 
is  basically  an  argument  that  a  smaller  model  is  usually  better.  As  discused  in  the 
previous  paragraph,  we  would  like  to  have  more  expressive  formalisms  for  modelling 
stochastic  processes  since  we  are  typically  dealing  with  problems  of  approximating 
a  system.  However,  if  the  formalism  involves  too  many  degrees  of  freedom,  it  will 
suffer  from  the  curse  of  dimensionality  -  it  will  become  very  difficult  to  estimate 
the  values  of  the  model  parameters  from  the  sparse  data  that  is  typically  available. 
So  we  basically  want  to  “say  more  with  fewer  parameters”.  We  can  also  make  the 
computational  argument  that,  in  general,  the  more  parameters  we  have  to  manipulate, 
the  slower  all  our  algorithms  will  be.  At  the  same  time,  the  formal  methods  of 
manipulating  HMMs  are  so  easy,  intuitive  and  efficient  that  we  would  love  to  be 
able  to  keep  them.  The  Generalized  Markov  Models  defined  in  this  thesis  achieve 
both  these  goals  by  preserving  the  formal  structure  of  HMMs,  but  liberating  them 
from  constraints  that  limit  the  class  of  processes  a  given  number  of  parameters  could 
model.  Essentially,  we  attempt  to  get  more  mileage  from  each  parameter  of  a  model 

^We  do  not  releuc  the  stochastic  constraints  on  the  ouput  matrix  because  this  makes  analysis 
considerably  harder. 
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by  allowing  it  to  range  over  a  greater  domain  in  a  natural  way.  We  will  see,  for 
example,  that  the  smallest  HMM  equivalent  to  a  given  model  may  have  more  states 
that  its  smallest  representation  in  the  GMM  formalism.  This  is  our  principal  reason 
for  defining  GMMs. 

We  can  see  from  these  arguments  that  it  may  be  worthwhile  to  consider  gener¬ 
alizations  of  HMMs  as  techniques  for  modelling  stochastic  processes,  specially  for 
pattern  recognition  applications.  In  particular,  we  have  seen  that  it  may  be  a  good 
idea  to  relax  the  positivity  constraint  on  the  parameters  of  Hidden  Markov  Models. 
We  will  now  define  Generalized  Markov  Models  and  discuss  their  properties. 

3.1.2  Definition  of  GMMs 

Our  first  task  is  to  define  what  we  mean  by  “relaxing  the  positivity  constraint”  on 
probabilities.  To  this  end  we  make  the  following  definition  of  a  pseudo-stochastic 
vector; 

Definition  3.1  Pseudo-probability  and  Pseudo-stochasticity 

Define  an  n- dimensional  vector  v  to  be  pseudo-stochastic  if  each  of  its  components 
is  real  and  u,  =  1.  Each  entry  of  such  a  vector  is  called  a  pseudo-probability. 
Pseudo-probabilities  of  alternative  independent  events  add  just  like  true  probabilities. 
Also  define  a  pseudo- stochastic  matrix  to  be  one  whose  columns  are  pseudo-stochastic 
vectors.  A  pseudo-Markov  Chain  is  a  Markov  Chain  whose  transition  matrix  and 
prior  distribution  are  both  pseudo-stochastic.  In  the  rest  of  this  chapter  we  will  use 
frequently  use  the  term  “probability”  even  when  we  mean  pseudo-probability.  The 
usage  will  be  obvious  from  the  context. 

We  will  define  GMMs  by  essentially  replacing  the  probabilities  describing  the  under¬ 
lying  Markov  Chain  of  an  HMM  with  pseudo-probabilities.  We  will  need  to  impose 
some  additional  constraints  on  allowable  priors  on  to  ensure  that  the  model  describes 
valid  stochastic  processes. 
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Definition  3.2  Generalized  Atarkov  Models  (GMMs) 

.4  Generalized  Markov  Model  is  defined  as  a  quadruple  .SA  =  (5,0,  A.  B)  where  S  is 
a  set  of  n  states,  O  is  a  discrete  set  of  k  outputs  and  B  is  a  stochastic  output  matrix 
as  in  the  definition  of  HMMs.  Define  an  n-dimensional  pseudo-probability  vector  v 
to  be  possible  for  M  if  the  product  Bi;  is  a  stochastic  vector.  (In  other  words  v  is 
possible  i/ B  maps  v  to  a  probability  distribution  over  the  outputs.)  Also  define  an 
n-dimensional  vector  u  to  be  valid  for  ifu  induces  a  valid  stochastic  process  when 
is  initialized  by  u  and  evolved  according  to  the  formal  rules  specified  in  Chapter  1. 
We  demand  that  all  n-dimensional  stochastic  vectors  be  valid  for  M.  The  transition 
matrix  X  of  a  GMM  must  then  be  a  pseudo-stochastic  matrix  whose  columns  are  valid 
vectors  for  AA. 

We  can  see  that  GMMs  are  very  similar  to  HMMs  except  that  the  underlying  chain 
is  a  pseudo-Markov  Chain.  By  this  definition,  every  HMM  is  structurally  a  GMM, 
but  in  the  GMM  formulation  we  would  be  permitted  to  initialize  the  model  with 
valid  priors  that  are  not  stochastic.  Definition  3.2  is  not  very  constructive  in  that  it 
does  not  characterize  what  the  valid  priors  on  a  model  look  like.  The  results  we  will 
arrive  at  in  this  chapter,  including  the  derivation  of  canonical  forms  for  GMMs,  do 
not  require  such  a  characterization.  We  will  return  to  this  sticky  issue  briefly  at  the 
end  of  the  section. 

GMM  Evolution:  We  will  evolve  a  GMM  forward  in  time  by  treating  pseudo¬ 
probabilities  formally  as  if  they  are  true  probabilities.  In  particular  projection  and 
transition  operators  are  formally  defined  exactly  as  in  Table  1.1.  The  only  difference 
lies  in  the  interpretation  of  the  various  quantities.  The  {ijY^  component  of  the 
transition  operator  T(o/t)  is  now  understood  to  be  the  pseudo-probability  that  the 
underlying  chain  will  transition  from  state  Sj  to  s,,  weighted  by  the  true  probability 
of  emitting  o*  in  state  sj.  All  probabilities  related  to  the  states  in  an  HMM  are 
replaced  by  pseudo-probabilities  in  a  GMM,  but  we  still  retain  the  true  probability 


3. 1 .  GENERALIZED  MARKOV  MODELS 


4‘J 


interpretation  of  distributions  over  outputs.  The  suffix  matrix  of  Definition  2.5  will 
be  important  to  us  in  our  discussion  of  reduction  of  HMMs.  For  any  string  x,  the 
suffix  matrix  is  defined  as  H(x)  =  BT(x)  where  B  is  the  GMM  output  matrix  and 
T(x)  is  the  GMM  transition  operator  for  string  x.  In  the  context  of  GMMs  II(x),j 
is  the  probability  that  the  model  emits  the  string  xo,  given  a  pseudo-probability  of  1 
that  the  model  started  in  state  Sj.  The  meaning  of  the  vectors  a{x)  in  Definition  2.5 
is  also  appropriately  modified.  Henceforth,  when  we  speak  of  transition  operators, 
suffix  matrices  or  any  other  quantity  originally  defined  for  HMMs  in  the  context  of 
GMMs,  we  will  be  referring  to  these  objects  interpreted  as  described  above. 


3.1.3  Properties  of  GMMs 

The  most  important  observation  to  make  about  the  properties  of  Generalized  Markov 
Models  is  that  all  the  equivalence  results  of  the  previous  chapter  carry  over  with  only 
minor  modifications.  In  this  section  we  will  describe  these  modifications.  First  of 
all,  we  define  equivalence  of  GMMs  and  Initialized  GMMs  in  exactly  the  same  terms 
as  for  HMMs.  Priors  are  equivalent  if  the  induce  the  same  stochastic  process  on  a 
model,  and  initialized  models  are  equivalent  if  they  represent  the  same  stochastic 
process.  The  essential  difference  is  just  that  we  will  allow  pseudo-stochastic  priors 
and  transition  matrices.  Then,  Theorems  2.1  and  2.2  concerning  equivalence  of  prior 
distributions  on  HMMs  apply  immediately  to  equivalence  of  pseudo-priors  on  GMMS. 
We  can  see  this  is  the  case  because  the  proofs  of  these  theorems  rely  only  on  the  linear 
structure  of  the  model  and  do  not  depend  on  any  property  related  to  stochasticity. 
Consequently,  the  characterization  of  equivalent  Initialized  HMMs  applies  at  once 
to  Initialized  GMMs  also.  At  first  sight,  it  appears  to  be  a  little  more  difficult 
to  translate  the  theorems  concerning  equivalence  of  HMMs  into  the  GMM  context, 
because  they  appear  to  require  various  quantities  to  be  stochastic.  However,  a  more 
careful  examination  shows  they  only  depend  on  the  fact  that  stochastic  vectors  sum 
to  one.  The  positivity  of  probabilities  is  not  used  anywhere.  We  will  use  this  to  state 
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the  following  lemmas  concerning  equivalent  GMMs.  VVe  will  only  sketch  the  proofs 
since  they  parallel  those  of  Chapter  2  with  minor  modifications  that  the  reader  can 
easily  see.  As  before,  we  will  say  that  Aii  Q  -'di  if  every  stochastic  process  that  can 
be  generated  by  setting  a  pseudo-prior  on  M2  can  also  be  generated  by  Mi. 

Lemma  3.1  Transformation  of  Pseudo-priors  on  GMMs 
If  Ml  =  (<Si,0,Ai,BC  n.nd  M2  =  A2,  B2)  are  GMMs,  then  M2  Q  Mi 

if  and  only  if  there  ext  i  pseudo-stochastic  matrix  Csuch  that  we  can  write 
B,Ti(x)C  =  B2T2(x)  j  very  x  €  O' U  {e}.  Furthermore,  suppose  we  know  that 
C'  is  a  pseudo-stochastic  matrix  that  transforms  the  stochastic  priors  p  on  M2  into 
equivalent  valid  priors  q  on  Mi-  Then  C'  transforms  every  valid  prior  on  M2  into 
an  equivalent  valid  prior  on  Mi,  so  that  M2  Q  Mi. 

Proof:  The  proof  of  the  first  part  of  Lemma  3.1  follows  the  proof  of  Lemma  2.1. 

Essentially,  we  consider  pseudo-priors  62(1)  with  all  the  mass  on  a  state  Si  of  yVf2- 
By  assumption  of  Mi  M2,  there  are  equivalent  pseudo-priors  pi{i)  on  Mi-  The 
Pi(i)  are  necessarily  valid  for  Mi  because  they  induce  valid  stochastic  processes  by 
assumption.  The  columns  of  the  transformation  matrix  C,  as  in  Lemma  2.1,  will  be 
set  equal  to  the  pi(t).  The  proof  then  exactly  parallels  that  of  Lemma  2.1.  To  prove 
the  second  part  of  the  lemma,  suppose  that  C'  transforms  stochastic  priors  on  2  011 
equivalent  valid  priors  on  Mi.  Then,  it  transforms  the  €2(1)  into  pseudo-stochaistic 
pii  =  C'e2(i)  such  that  Pr(i|Af2, 62(1))  =  Pr(x|Ali,pii)  for  every  string  x.  Next, 
observe  that  every  valid  prior  p2  on  M2  can  be  written  as  a  linear  combination  of  the 
stochastic  unit  priors  €2(1):  p2  =  Hrii  ai^2(i)-  Consequently,  we  can  write  for  every 
X  E  O'  U  {c}  that: 

?t{x\M2,P2)  =  Pr(x|At2,e2(0) 

.=1 

=  Pr(x|.Mi,C'e2(0) 

1=1 
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=  Pr(j-|.V1i.  ^  u.C't^(O) 

1=1 

=  Pr(j-|.\/(i,C'/7j)  (3.1) 

This  indicates  that  it  is  always  true  that  .^2(^2)  ^  -Vli(C')  so  long  as  pi  is  a  valid 
prior  for  M2-  Since  we  only  assumed  that  C'  correctly  transformed  stochastic  priors, 
this  proves  the  second  part  of  the  lemma.  □ 

The  second  part  of  the  lemma  essentially  says  that  we  get  equivalence  of  GMMs 
for  free  if  we  can  prove  that  the  stochastic  priors  on  a  pair  of  machines  can  be 
transformed  into  equivalent  pseudo-priors  on  each  other.  A  corollary  of  this  is  that 
equivalent  HMMs  are  also  equivalent  GMMs.  This  is  true  because  we  know  that  if 
M2  and  M\  are  HMMs,  and  A^2  S  then  we  can  transform  stochastic  priors  on 
M2  into  equivalent  priors  on  M\  using  the  transformation  matrix  C  of  Lemma  2.1. 
Therefore,  Lemma  3.1  tells  us  that  C  also  transforms  all  valid  priors  on  *Vl2  into  valid 
priors  on  M\,  implying  that  M2  Q  M\  even  when  the  models  are  treated  as  GMMs. 

Finally,  we  turn  our  attention  to  Lemma  2.2  and  Theorem  2.3  which  proved  nec¬ 
essary  and  sufficient  conditions  for  the  equivalence  of  HMMs.  Using  Lemma  3.1, 
and  our  earlier  discussion  of  the  suffix  matrix  for  GMMs,  we  can  see  that  these 
results  can  be  applied  directly  in  the  GMM  context.  We  would  simply  need  to  re¬ 
quire  that  the  transformation  matrix  C  they  invoked  be  pseudo-stochastic  instead 
of  stochastic.  Having  convinced  ourselves  that  all  the  results  characterizing  equiva¬ 
lence  of  HMMs  carry  over  to  GMMs  also,  we  see  that  the  algorithms  developed  in 
Chapter  2  can  be  applied  to  GMMs  also.  We  only  need  to  modify  the  algorithm 
for  checking  equivalence  of  un-initialized  HMMs  by  relaxing  the  stochcisticity  require¬ 
ment  on  the  transformation  matrix  C  that  it  solves  for.  This  actually  makes  the 
algorithm  more  efficient  since  we  now  only  need  to  solve  a  system  of  linear  equali¬ 
ties  rather  than  inequalitites.  (We  no  longer  need  the  constraint  that  the  entries  of 
C  should  be  non- negative.)  Standard  methods  for  solving  systems  of  linear  equalities 
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run  in  0(m^  +  m^n)  time  where  n  is  the  number  of  variables  and  m  is  the  number 
of  equations. [pressOO]  Repeating  the  analysis  of  the  algorithm  for  determining  equiv¬ 
alence  of  HMMs,  we  find  that,  in  the  worst  case,  we  will  need  to  solve  njnj  equations 
in  nin2  variables,  subject  to  ^2  pseudo-stochasticity  constraints.  This  would  take 
time  O  [(nin2)^  +  (uin2)^nin2|  =  0((nin2)^].  VVe  conclude  that  our  algorithm  for 
deciding  yV(2  Q  where  vVf2  «ind  AAi  are  GMMs,  hcis  a  w’orst-case  running  time  of 
0(nik(ni  +  02  +  20102)  +  (0102)^).  This  is  somewhat  better  than  the  running  time 
achieved  in  the  context  of  HMMs.  As  before  ^Vfi  O  is  decided  by  checking  that 
.Vt2  C  Ml  and  Mi  C  ,Vl2- 


Our  discussions  of  Generalized  Markov  Models  have  swept  an  important  issue 
under  a  definitional  rug.  Our  formulation  of  GMMs  is  not  satisfactory  since  it  does 
not  characterize  what  makes  a  given  pseudo-stochastic  vector  valid  for  a  given  model. 
Consequently,  the  definition  is  not  clear  about  exactly  what  forms  the  transition 
matrix  A  is  allowed  to  take.  Since  this  thesis  only  compares  GMMs  with  each  other, 
this  does  not  become  a  difficulty  for  us  -  we  will  always  work  with  models  that  are 
presumed  to  be  well-defined.  (Obviously,  some  such  models  exist  since  HMMs  are 
themselves  GMMs  with  priors  restricted  to  be  stochastic.)  However,  if  we  want  to 
build  GMMs  for  practical  applications  we  must  have  a  more  constructive  method 
of  evaluating  the  validity  of  pseudo-stochastic  vectors  for  a  given  model.  At  least 
partly  because  of  the  non-constructive  definition  of  GMMs,  we  have  not  discussed 
the  issue  of  parameter-estimation  and  training  of  these  models  from  data.  However, 
even  without  properly  understanding  the  nature  of  valid  vectors  for  GMMs,  we  can 
make  progress  towards  developing  training  algorithms.  Some  relevant  ideas  will  be 
presented  in  the  next  chapter. 
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3.2  Canonical  Dimensions  and  Forms 

W’e  will  now  define  the  canonical  dimension  of  a  CMM.  This  will  be  a  measure  that 
characterizes  the  essential  degree  of  freedom  available  in  the  model.  As  decribed 
above,  we  will  freely  borrow  from  the  notation  defined  in  Chapters  1  and  2  to  ma- 
n'pulate  HMMs.  Distributions  over  the  outputs  of  the  model  -vill  remain  stochastic. 
However,  “distributions”  over  the  states  of  the  model  will  be  pseudo-storhastic.  VVe 
will  now  use  the  suffix  matrix  (Definition  2.5)  to  define  the  canonical  dimension  of  a 
Generalized  Markov  Model. 

Definition  3.3  Canonical  Dimension 

Let  .V(  be  a  Generalized  Markov  Model  with  suffix  matrices  D(i)  for  every  x  G  0“U{e} 
as  in  Definition  2.5.  Also  let  U  =  {^(y)  :  y  €  O*}  be  the  set  of  all  rows  of  the  suffix 
matrices  of  AA  as  in  Lemma  2.2.  We  define  the  canonical  dimension  of  M(dM)  to 
be  the  dimension  of  the  space  spanned  by  the  vectors  in  U .  In  other  words,  dM  = 
dim  [5pan(i.')|. 

In  order  to  understand  the  m'^  ilng  of  the  canonical  dimension  of  a  model,  remember 
that  if  a(x)  G  U,  then  the  component  of  a{x)  is  the  probability  that  the  model 
starts  in  state  sj,  and  emits  the  string  x.  So,  in  some  sense,  the  canonical  dimen¬ 
sion  of  a  model  captures  the  maximal  degree  of  freedom  we  have  to  define  different 
stochastic  processes  by  setting  up  different  valid  prior  distributions.  Our  definition 
is  also  motivated  by  the  following  easy  result  that  equivalent  GMMs  must  have  the 
same  canonical  dimension. 

Theorem  3.1  Invariance  of  Canonical  Dimensions 

Let  M.I  be  a  GMM  with  nj  states  and  canonical  dimension  di.  Let  M2  be  any  GMM 
with  n2  states  that  is  equivalent  to  Mi.  Let  dj  denote  the  canonical  dimension  of 
M2.  Then  it  must  be  the  case  that  dj  =  dj  and  n2  >  dj. 
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Proof:  If  ^Vti  ^  ^^2,  then  .Vfi  C  and  ^Vt2  Q  AA.\.  Suppose  then  that 

.Vfi  S  Ml.  Then,  using  Lemma  3.1.  and  we  can  write: 

Vi  €  (9' U  {c}  ;  (7i(i)C  =  <72(1)  (3.2) 

But  we  can  expand  <7i{i)  in  terms  of  a  basis  {<Ti(i,)}  for  U,  the  span  of  {(7i(i)}.  to 
write: 


di 


di(x)C  =  ^6.  I,)C 

(3.3) 

di 

=  Y.b,d2{x,) 

t 

(3.4) 

di 

=>  ^2(-r)  = 

(.3.5) 

Equation  3.5  shows  that  the  collection  of  vectors  {i72(xi)}  forms  a  basis  for  the  span 
of  U2  so  that  d2  <  |{^2(^i)}|  =  =  di-  Similarly,  since  Mi  C  M2  also, 

we  can  say  that  di  <  d2  giving  us  the  result  that  di  =  ^2.  Finally,  notice  that  the 
canonical  dimension  of  a  model  M  with  n  states  must  be  less  than  or  equal  to  r>, 
since  the  a(i)  vectors  for  M  will  have  only  n  components.  So,  if  M2  is  equivalent  to 
AA 1 ,  n2  ^  d2  —  di-  Q 


Theorem  3.1  tells  us  that  we  cannot  build  a  GMM  equivalent  to  Mi  with  less 
than  di  states.  Next  we  want  to  show  that  if  yV(lh^lS  canonical  dimension  di  and  ni 
states,  where  ni  >  di,  then  we  can  effectively  construct  an  equivalent  model  M'  with 
only  di  states.  We  will  prove  this  by  first  demonstrating  how  a  particular  special 
type  of  GMM  can  be  reduced.  We  will  then  reduce  every  GMM  to  this  special  form, 
thereby  proving  the  desired  result. 

Lemma  3.2  Reduction  of  a  Special  Form 

Let  M  =  (5,0,A,B  )  be  a  GMM  with  n  states.  Let  I  be  the  largest  subspace  of  the 
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null-space  of^  that  is  invariant  under  each  of  the  transition  operators  T(oi).  Also 
let  Bt  andT,{x)  denote  the  columns  of  Hand  'T{x)  respectively.  Suppose  that  there 
is  a  collection  of  coefficients  {f,}}  and  an  index  a,  1  <  a  <  n  such  that: 

n 

\/ 1  <  a  :  Bt  =  ^  fjiBj  (3.6) 

j=at+l 

n 

'iok  £  O  and'i  l  <  a  :  fi{ok)=  ^  fjifj[ok)-\-Ni{ok)  (3.7) 

j=i>+i 

where  Ai{ok)  €  J.  will  call  the  states  S2i  ’  ‘  ‘  ■Sq}  dependent  states  of 
.\4,  and  {sa+i,  5^+21  •  • -Sn}  the  independent  states  of  Ai.  We  can  build  a  model 
M'  =  (5',(!?,.4',B')  with  n'  =  n  —  a  states,  such  that  Ai'  ^  At,  and  S'  contains 
only  the  independent  states  of  M. 

Prior  to  proving  the  lemma  it  will  help  to  have  some  intuitions  for  why  it  should 

be  true.  The  lemma  basically  says  that  a  model  can  be  reduced  to  a  smaller  size 

if  the  output  distributions  are  linearly  dependent  and  the  corresponding  columns  of 
every  T(o*)  are  dependent  with  the  same  coefficients.  The  basic  idea  of  the  proof 
is  to  realize  that  passing  through  one  of  the  states  s/  for  /  <  a  is  indistinguishable 
from  passing  through  the  states  for  m  >  q  with  pseudo-probabilities  weighted 
according  to  the  appropriate  linear  dependency  coefficients.^  (See  Figure  3.2)  We  can 
use  this  observation  to  redistribute  the  priors  and  the  outgoing  probabilties  from  each 
state  in  such  a  way  that  the  linearly  dependent  states  are  never  visited  and  can  be 
thrown  away.  The  proof  below  is  simply  a  formalization  of  this  idea. 

Proof:  In  the  following  discussion  we  will  adopt  the  convention  that  variables 

indexing  the  states  of  Ai'  will  range  over  q  -|-  1  to  n.  Our  proof  will  proceed  in  five 
steps.  First  we  will  define  B'  and  A'.  In  the  second  step  we  will  prove  an  useful 

^This  is  true  up  to  the  vector  Ai(oi)  However,  Ai(oi)  lies  in  an  invariant  subspace  of  the 
null-space  of  B.  Consequently,  it  never  contributes  to  distributions  over  the  outputs,  and  can  be 
ignored. 
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The  figure  shows  an  HMM  for  which  Bi  =  f2B2  +  /3B3  and  the  T(o*)  satisfy  Equa¬ 
tion  3.7.  (We  have  suppressed  the  output  distributions  in  the  figure.)  In  order  to 
remove  the  dependent  state  Si,  we  excise  the  transitions  to  Si  and  add  them  to  the 
transitions  between  the  independent  states  weighted  appropriately  by  /j  and  fs.  The 
priors  are  redistributed  in  the  same  way.  If  we  do  this,  observe  that  s\  is  never  visited 
and  can  be  thrown  away. 

Figure  3-1:  Reduction  of  A  Special  Form 


invariance  property  01  A'.  Next  we  will  define  a  pseudo-stoch<istic  transformation  of 
the  priors  on  AA  into  priors  on  M! .  Then  we  will  use  the  invariance  property  of  A'  to 
show  that  M.  C  M' .  Finally,  we  will  demonstrate  that  M!  C  M..  We  will  find  it 
convenient  to  define  the  following  matrix: 


F  = 


/(o  +  l)l 

/(o+l)2  • 

■  •  /(«+!) 

f{a+2)l 

f(a+2)2  ■ 

/(a+2) 

/nl 

fn2 

fna 

a 

or 


(3.8) 
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So  F  is  an  n'  X  a  matrix  whose  components  are  the  expansion  coefficients  assumed 
in  the  lemma.  Note  that  F  must  be  pseudo-stochastic,  since  all  the  vectors  on  both 
sides  of  Equation  3.6  are  stochastic  and  therefore  sum  to  one.  We  are  now  ready  to 
construct  the  reduced  model  M'. 

First  of  all,  we  will  take  the  new  output  matrix  B'  to  simply  be  the  last  n'  =  n  —  a 
columns  of  B.  Our  earlier  intutitions  concerning  A'  said  that  the  transitions  to 
dependent  states  should  be  redistributed  according  to  the  weights  of  the  expansion 
coefficients.  Putting  this  idea  into  symbols  gives: 


A',  =  +  (3.9) 

/=! 

We  can  use  the  F  matrix  defined  earlier  to  compactly  write  down  the  rela?  nship 
between  A,  A',  B  and  B': 


A' 


B 


[Fl-^n'xn'l  A 

B'[F|/„.xn'] 


Ooxn' 

In'xn' 


(3.10) 

(3.11) 


/n'xTi'  is  the  n'  by  n'  identity  matrix  and  [F|/„-xn']  is  the  matrix  consisting  of  F  and 
In'xn'  concatenated  together.  Oaxn'  is  the  a  x  n'  zero  matrix.  Now  suppose  that 
is  a  vector  such  that  is  the  pseudo-probability  that  the  model 

M.  emits  the  string  ii_i  and  then  enters  the  state  at  time  t.  (This  is  the  pseudo¬ 
distribution  over  states  before  seeing  the  output  at  time  t.)  Then  suppose  it  is  also 
true  that: 

F'(f,x,_i)  =  [F|/„,xn']  (F(t,x,.i)  +  ^  (3.12) 

where  6  is  a  vector  that  lies  in  I.  We  claim  that  if  Equation  3.12  holds,  then  the 
joint  probability  of  the  output  at  time  t  and  Xt~i  is  the  same  for  M  and  M'.  Fur¬ 
thermore,  regardless  of  the  output  at  time  t  it  will  be  true  that  P'{t  +  1,X()  = 
,  i()  -f  where  6'  is  a  vector  lying  in  J,  the  invariant  subsapce  of 
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the  null-space  of  B.  We  can  prove  the  first  part  of  the  claim  by  observing  that: 

=  B  -I- ^ 

=  BP(t.x,_,)  (3.13) 

The  last  equation  follows  because  S  is  in  the  null-space  of  B.  In  order  to  prove 
the  second  part  of  the  claim  we  aissume  without  loss  of  generality  that  o{t)  =  o* 
and  evolve  the  model  forward  in  time.  In  order  to  do  this,  note  that  the  transition 
operator  T'(oit)  can  be  written  <is: 


T'(o0  =  A'Bl 


• 

“ 

=  [F|/n'xn']A 

Oofxn' 

[On'  X  a  1  In'  x  n'  |  Bfc 

Ooxn' 

In'xn' 

In'xn' 

=  [Fj/n'xn'lA 


OorXn 


On' 


/n'xn' 


Oaxn' 
I  n'xn' 


(3.14) 

(3.15) 

(3.16) 


where  we  have  used  the  fact  that  B^  consists  of  the  last  n’  rows  and  columns  of  B*. 
We  can  simplify  this  a  little  further  by  using  the  notation  A,  =  column  of  A  to 
write: 


Ooxn 


On'xot 


I n'xn' 


Bfc  —  fOnXa  l^a+l  1^0+2!  '  ’  |^n 


B, 


(3.17) 


-  |0tiXO  |i;+i(o0lfa+2(Ofc)|---|f„(o,)]  (3.18) 


Using  this  we  can  conveniently  compute  P  '{t  +  l,Xt)  as  shown  below.  We  will  let  S  and 
6 '  f’enote  vectors  in  J,  the  invariant  subspace  of  the  null-space  of  B.  For  compactness 
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of  the  equations  we  will  also  write  Ta{ok)  for  [0„xa|T'a+i(ofc)|rc,+2(ofc)|  •  •  •  irn(ofc)|. 

P'{t+l,xt)  =  T{ok)P'{t,x,_0  (3.19) 

=  [F|/„.xn']r.(oO  [F|/„-xn'](P(^X,_i)  +  ^)  (3.20) 

In'xn' 

=  [F|/„-xn<]r,(oO  - -  {P{t.X,.,)  +  S)  (3.21) 

F  In'xn' 

We  can  now  use  the  fact  that  the  columns  of  T(ot)  are  linearly  dependent  according 
to  Equation  3.7  to  write; 

[0„xa|rc+i(o*)|TL+2(ot)|  •  •  -  |rn(Oi)] 

T(Ofc)  +  [Ai(0fc)|A2(0fc)|  •  •  •  |Aa(Oit)|0„xn'j  (3.22) 

T(0fc)  +  A  (3.23) 

where  we  have  set  A  =  [Ai(ojfe)jA2(ojt)|  •  •  •  lAa(oi)|0„xn']  ■  Observe  that  for  any  vector 
X  of  appropriate  dimension,  Ai  €  I  since  every  column  of  A  is  an  element  of  J,  the 
invariant  null-space  of  B.  Therefore,  plugging  Equation  3.23  into  Equation  3.21,  we 
find  that: 

P'{t  +  l,x,)  =  [F|/„,x.'](T(ofc)P(t,x,.i)-h^')  (.3.24) 

=  [F|/„,xn'](P(<  +  l,xt)  +  5')  (3.25) 

where  6 '  is  some  vector  in  J."*  Equation  3.25  shows  us  that  if  the  pseudo-probabilities 
on  M!  satisfy  Equation  3.12  at  time  t,  they  do  so  also  at  time  t  -f  1  and,  by  induction 
on  t,  for  all  future  times.  This  invariance  property  of  A  will  be  useful  shortly  in 

“•We  get  Equation  3.24  by  using  the  facts  that  T(ojfc)^  €  I  since  6  £  I,  and  Ax  G  I  for  any  x  as 
discussed  before. 
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proving  that  yVf'  and  vVf  are  equivalent. 

We  are  finally  in  a  position  to  show  that  ^\/l'  C  ,V(.  It  /T  is  a  prior  on  ,V(. 
let  p'  =  [F|/„<xa']p  be  the  corresponding  prior  on  M' .  These  prior  distributions 
cause  Equation  3.12  to  be  satisfied  for  t  =  0  and  x  =  t.  Therefore,  by  our  earlier 
discussion,  Equation  3.12  is  satisfied  for  all  times  t  and  strings  We  also  showed 
that  if  Equation  3.12  is  satisfied,  then  the  two  models  have  the  same  probabilities  of 
producing  the  various  outputs.  Hence,  we  can  conclude  that  M'(p')  ^  Since 

[F|/„<xn']  is  a  pseudo-stochastic  transformation  of  priors  on  M  into  equivalent  priors 
on  M',  we  know  that  M  C  M'.  To  show  that  M'  C  M,  we  will  first  show  that 
every  stochastic  prior  on  M'  can  be  transformed  into  an  equivalent  valid  prior  on  M. 
Lemma  3.1  will  then  show  that  Ai'  C  So  suppose  that  q'  is  a  stochastic  prior  on 
M' .  Then,  construct  a  prior  q  on  M.  such  that: 


9  = 


Ooxn' 

Jn'xn' 


9'  =  (Oa.^) 


(3.26) 


where  Oq  is  the  a— dimensional  zero  vector.  We  can  see  at  once  that 
Therefore,  our  earlier  discussion  shows  that  M'{q')  So  Equation  3.26  de¬ 

fines  a  pseudo- stochastic  transformation  of  stochastic  priors  on  M'  into  equivalent 
valid  priors  on  M.  By  Lemma  3.1  we  can  conclude  that  M  C  M' .  Putting  every¬ 
thing  together  we  finally  reach  the  desired  conclusion  that  AA'  AA.  □ 


All  that  remains  in  our  quest  to  find  minimal  representations  for  GMMs  is  a 
way  of  transforming  all  reducible  GMMs  into  the  special  form  that  was  reduced  in 
Lemma  3.2.  We  will  now  prove  a  theorem  that  shows  that  all  reducible  GMMs  are 
already  is  the  special  form  of  Lemma  3.2.  This  is  then  used  to  reduce  GMMs  to  their 
minimal  equivalent  representations. 

Theorem  3.2  Reduction  of  GMMs  to  Minimal  Representations 
Let  AA\  —  («Si,C?,Ai,B  1)  be  a  GMM  with  nj  states  and  canonical  dimension  d\  <  nj. 
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Then  A/{\  can  be  reduced  to  a  minimal  equivalent  model  with  only  di  states.  If  a 
model  has  only  as  many  states  as  its  canonical  dimension,  we  will  call  it  a  minimal 
representation  for  its  equivalence  class. 


Proof:  We  defined  the  canonical  dimension  of  AA\  to  be  the  dimension  of  the  span 

of  U\  =  {0^1(1/)  :  y  G  0“},  where  the  ffi(y)  are  rows  of  the  suffix  matrices  of  M.  Let 
V  =  ^1(3:2).  •  •  •  be  a  collection  of  vectors  in  ili  that  forms  a  basis 

for  Span{U\).  Then  consider  a  matrix  G  whose  rows  are  the  elements  of  V.  We  can 
write: 


d'i(ii) 

d'i(i2) 


* 

9\ 

92 

9ni 

(3.27) 


(In  this  equation  the  vectors  gi  represent  the  columns  of  G.)  G  is  a  di  x  ni  matrix 
whose  rows  are  linearly  independent.  So,  it  has  a  row-rank  di  and  this  means  that 
its  column  rank  is  also  di.  So,  there  are  only  dj  independent  columns  in  G.  Assume, 
without  loss  of  generality,  that  the  leist  dj  columns  of  G  are  the  independent  columns 
and  let  a  =  ni  —  di.  There  must  be  a  set  of  coefficients  {/,/}  such  that  we  can  write: 


Vl<a:  g,=  f,ig,  (3.28) 

>=Or+l 

We  are  going  to  use  this  fact  to  show  Aix  already  satisifies  the  conditions  of  Lemma  3.2 
and  can  therefore  be  reduced  to  a  smaller  size.  In  order  to  do  this  we  will  find  it 
convenient  to  introduce  the  following  matrix: 


f(a+\)2  • 

f(a+\)a 

f{a+2)l 

f(a+2)2  ■ 

f(a+2)a 

Ul 

fn2 

fna 

(3.29) 
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This  matrix  is  formally  the  same  as  the  F  matrix  used  in  the  proof  of  the  special  case 
reduction  lemma.  We  will  see  that  the  similarity  is  not  coincidental.  We  can  use  the 
F  matrix  to  rewrite  Equation  3.28  more  compactly  as  follows: 


G 


*QrXa 

F 


=  0 


(3.30) 


Remember  now  that  every  row  of  every  suffix  matrix  Ei(i)  can  be  written  as  a  linear 
combination  of  the  rows  of  G.  This  implies  that  corresponding  to  every  matrix  Si(  j), 
there  is  another  matrix  S(i)  such  that  Ei(i)  =  S(x)G.  (The  row  of  S(i)  contains 
the  coefficients  expressing  the  row  of  Ei(x)  as  a  linear  combination  of  the  rows  of 
G.)  Using  this  we  find  that; 


Vx  6  O*  U  {c}  :  Ei(x) 


=  S(x)G 


'^axa 

F 


=  0 


(3.31) 


By  picking  x  =  e  so  that  Si(x)  =  Bi,  and  expanding  the  matrix  notation  into  a 
summation,  we  find  that: 


V/<a:  Bi=  X;  (3.32) 

;=o+l 


where  Bi  is  the  column  of  B.one  Notice  that  this  is  exactly  the  first  condition 
we  need  in  order  to  apply  our  earlier  lemma  on  reduction  of  certain  special  types  of 
GMMs.  Next,  for  notational  convenience,  we  define  A(oi)  such  that; 


A(ojt)  =  Ti(ojfe) 


•oxa 

F 


(3.33) 


We  will  refer  to  the  columns  of  Aok  and  Ti(ofc)  as  A,(ojt)  and  T,{ok)  respec¬ 
tively.  Then,  for  any  string  y  =  OkX  which  starts  with  the  output  Ok  we  can  write 
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Equation  3.31  «is: 


Since  this  equality  holds  for  every  string  x  6  O’  U  {e},  we  can  conclude  that  the 
columns  of  A{ok)  are  elements  of  I,  the  invariant  null-space  of  Bi-  By  expanding  the 
definition  of  A{ok)  we  then  find  that: 

n 

'iOkeO  andyi<a:  =  Y.  fjlTAok)  A  Mok)  (3.35) 

J=0r  +  1 

where  the  Ai(ok)  €  J  are  the  columns  of  A{ot).  Now  Equations  3.32  and  3.35  are 
exactly  the  conditions  that  make  Lemma  3.2  true.  Consequently,  any  GMM  with 
canonical  dimension  di,  has  only  di  independent  states.  The  method  outlined  in  the 
proof  of  Lemma  3.2  can  then  be  used  to  reduce  Aij  to  an  model  ^Vf*  with  only  di 
states.  Since  Theorem  3.1  tells  us  that  no  smaller  model  can  be  equivalent  to  M, 
Af*  is  a  minimal  representation  oi  M..  □ 

Theorem  3.2  shows  how  a  GMM  can  be  reduced  to  a  minimal  representation.  We 
will  discuss  how  this  result  applies  to  Hidden  Markov  Models  in  Section  3.2.1.  In  ad¬ 
dition  to  finding  minimal  models,  we  also  want  our  representations  to  be  “canonical” 
in  the  sense  that  they  are  essentially  unique.  Next,  we  will  prove  two  theorems  that 
provide  a  deeper  understanding  of  the  essential  reasons  for  reducibility  of  GMMs,  and 
characterize  the  relationship  between  equivalent  minimal  representations  of  a  given 
model  M.. 

Theorem  3.3  Geometric  Characterization  of  Minimal  Represenations 
As  before,  we  will  call  a  model  a  minimal  representation  if  it  is  the  smallest 
model  in  its  equivalence  class.  A  model  is  minimal  if  and  only  if  its  invariant  null- 


BiTi(x)A(ot)  =  0 


(3.34) 
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space  I  consists  of  only  the  zero  vector.  Hence,  priors  are  equivalent  for  a  minimal 
representation  only  if  they  are  equal  to  each  other. 

Proof:  Let  M  =  (5,0,  A,  B)  be  a  Generalized  Markov  Model  with  n  states.  VVe 

remind  the  reader  that  the  invariant  null-space  J  is  the  largest  subspace  of  the  null- 
space  of  the  output  matrix  B,  that  is  invariant  under  the  action  of  every  transition 
operator  T(ojt).  Suppose,  first  of  all,  that  >V1  is  a  minimal  representation.  Suppose 
also  that  there  is  a  vector  6  ^  I  which  has  some  non-zero  components.  By  definition 
of  being  an  element  of  J  we  can  write: 

Vi€0*U{c}:  BT(i)6  =  S(a:)^  =  0  (3.36) 

By  picking  i  =  c  and  i  =  o^y  where  y  is  any  string  we  can  write: 

B^  =  0  (3.37) 

Vy€0-U{c}:  BT(y)  [T(o*)^  =  BT(y)A(ot)  =  0  (3.38) 

where  we  have  written  A(ofc)  for  T(ofc)6.  The  second  equation  says  that  A(ofc)  €  J, 
the  invariant  null-space  of  B.  Writing  this  out  as  an  equation  for  the  columns  of 
B  and  T(ojt),  and  assuming,  without  loss  of  generality,  that  Sj  ^  0,  we  find  that: 


= 

t-fs, 

j=2 

(3.39) 

Vofc  €  O  f,{ok)  = 

}=2 

(3.40) 

(As  before,  we  are  writing  and  Tt{ok)  for  the  the  columns  of  B  and  T(ofc)  re¬ 
spectively.)  But  this  means  that  si  is  a  dependent  state,  in  the  sense  of  Lemma  3.2, 
and  can  be  reduced  away.  This  contradicts  the  assumed  minimality  of  the  model. 
So  we  see  that  if  is  a  minimal  representation,  then  J  can  consist  only  of  the  zero 
vector.  Next  we  will  prove  that  if  J  =  {0},  then  the  model  is  necessarily  minimal.  So 
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assume  that  X  =  {0}.  Then  suppose  that  ^Vf  is  not  minimal  and  therefore  n  >  d,v<. 
where  is  the  canonical  dimension  of  M.  Theorem  3.2  then  tells  us  that  there  is 
a  collection  of  coefficients  not  all  of  which  are  zero,  such  that: 


Vi€0*U{f}: 


*Or  xa 

F 


=  0 


(3.41) 


where  F  is  defined  by  Equation  3.29.  Let  S  be  any  column  of  the  matrix  to  the  right 
of  S(i)  in  Equation  3.41.  Then  ^  is  a  vector  with  some  non-zero  components  that 
lies  in  J.®  This  contradicts  our  assumption  about  I,  telling  us  that  if  I  consists  only 
of  the  zero  vector,  the  model  cannot  have  more  states  than  the  canonical  dimension, 
and  is,  therefore,  minimal.  So  we  have  proved  that  for  a  model  A4  to  be  a  minimal 
representation,  it  is  necessary  and  sufficient  that  its  invariant  null-space  consists  only 
of  the  zero  vector.  Observe  that,  according  to  the  GMM  version  of  Theorem  2.1.  this 
implies  that  equivalent  priors  on  a  minimal  representation  are  equal  to  each  other.  □ 


Theorems  3.1  and  3.2  told  us  that  this  minimal  model  has  exactly  eis  many  states 
as  its  canonical  dimension.  The  result  proven  just  above  showed  that  a  minimal  model 
can  be  characterized  geometrically  as  having  an  invariant  null-space  consisting  only  of 
the  zero  vector.  Furthermore,  the  invariant  null-space  of  a  model  with  n  states  has  a 
dimension  n—dM  where  dx  is  the  canonical  dimension  of  the  model.  One  consequence 
of  this  is  that  no  two  unequal  priors  on  a  minimal  model  are  equivalent.  In  other 
words,  equivalence  of  priors  on  a  minimal  model  implies  equality  of  priors.  This  tells 
us  that  the  minimal  representation  indeed  removes  every  last  shred  of  redundancy 
available  in  a  model.  Every  stochastic  process  that  can  be  modelled  by  setting  the 
priors  on  the  machine  is  represented  precisely  once,  by  a  distinct  prior.  We  could  use 
this  to  build  an  algorithm  to  reduce  a  model  to  its  minimal  representation.  First  of  all, 


®This  is  so  because  for  every  string  x  we  know  that  E(i)  =  BT(x)6  =  0. 
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we  would  find  the  invariant  null-space  T  via  standard  methods  for  decomposing  vector 
spaces  based  on  their  invariance  properties  under  different  operators.  Then  we  would 
find  a  basis  for  J,  and  use  the  basis  vectors  as  shown  in  the  proof  of  Theorem  3.3  as 
the  linear  dependency  coefficients  required  by  the  reduction  lemma.  However,  we  can 
build  a  cleaner  algorithm  directly  from  Theorem  3.2.  We  will  do  this  after  proving 
one  more  theorem  which  characterizes  the  relationshi’.  between  equivalent  minimal 
representations  in  the  GMM  formalism  for  a  class  of  stochastic  processes. 

Theorem  3.4  Relationship  Between  Minimal  Representations 
Suvpose  =  (S,0,  A,B}  and  NT  =  (5',  O,  A',  B')  are  two  n-state  GMMs,  both  of 
which  are  minimal  representations  of  a  class  of  processes  with  canonical  dimension 
dM-  Then  M.  and  M!  are  related  by  a  change  of  basis  for  the  n~dimensional  space  of 
vectors  over  the  states. 

Proof:  Since  M  and  M*  are  equivalent  models,  Lemma  3.1  tells  us  that  there  are 

two  pseudo-stochastic  matrices  C  and  C'  such  that: 

VxeO*U{c}:  BT(i)C  =  B'T'(a)  (3.42) 

Vx€0*u{e}:  B'T'(x)C'  =  BT(x)  (3.43) 

Picking  X  =  e,  this  tells  us  that  BC  =  B'  and  B'C'  =  B.  Then,  substituting 
Equation  3.43  back  into  Equation  3.42,  and  bringing  all  terms  to  the  right  hand  side, 
we  find  that: 

Vx  €  O*  U  {c}  :  B'r(x)  [/  -  C'Cj  =  E'(x)  [/„xn  -  C'C]  =  0  (3.44) 

This  means  that  the  corresponding  _olumns  of  /„xn  and  C'C  are  equivalent  priors  for 
yVf'.  But  we  know  from  Theorem  3.3  that  priors  on  minimal  models  are  equivalent 
if  and  only  if  they  are  .ual.  So  we  conclude  that  C'C  =  Inxn-  Similarly,  we  find 
that  CC'  =  I„xn<  and  so  we  can  say  that  C'  and  C  are  non-singular  matrices  and  are 
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inverses  of  each  other. 

Now  define  the  terms  state  vector  space  and  output  vector  space  to  mean  the 
vector  spaces  dissociated  with  distributions  over  states  and  outputs  respectively.  We 
will  show  that  .M  is  the  same  as  model  wM'  specified  in  a  different  basis  for  the  state 
vector  space  of  the  model.  First  all,  suppose  U  is  a  vector  space,  and  S  is  a  non¬ 
singular  transformation  matrix  describing  a  change  of  basis  for  L'.  Then  the  change 
of  basis  is  described  by  the  following  tranformations: 

1.  Every  i  €  t/  is  transformed  to  Sf 

2.  Every  linear  operator  O  which  maps  U  into  U  is  transformed  to  SOS“T 

3.  Every  linear  operator  P  mapping  U  into  any  other  vector  space  is  trans¬ 
formed  to  PS'T 

Now  let  S  =  C'  and  let  S“’  =  C.  Equation  3.43  tells  us  that  the  priors  on  M  are 
mapped  onto  the  priors  on  by  S(i.e.,  p'  =  Sp).  We  have  already  observed  that 
B'  =  BS~T  Next,  consider  the  equation  BT(j/)T(x)S~^  =  B'T'(j/)T'(x).  Substitut¬ 
ing  for  BT(2/)  we  find  that  for  every  y  6  0*  U  {e}  we  can  write 

B'T'(y)ST(x)S-‘  =  B'T'(y)T'(x)  (3.45) 

B'T'(y)  (ST(x)S-‘  -  T'(x)I  =  0  (3.46) 

This  implies  that  the  corresponding  columns  of  ST(x)S~^  and  T'(x),  when  appropri¬ 
ately  normalized  to  sum  to  1,  would  be  equivalent  priors  for  M.' ■  So,  by  Theorem  3.3 
they  are  equal  to  each  other  and  we  can  write: 

T'(x)  =  ST(x)S-‘  (3.47) 

From  this  we  also  know  that  E'(x)  =  B'T'(x)  =  BS“^ST(x)S~'  =  BT(x)S“'  = 
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E(x)S  *.  Summarizing  our  conclusions  we  find  that: 


p  = 

Sp 

(3.48) 

B'  = 

BS-' 

(3.49) 

Tix)  = 

ST(i)S-‘ 

(3.50) 

E'(x)  = 

E(x)S-^ 

(.3.51) 

These  equations  decribe  transformations  that  are  formally  identical  to  a  basis  trans¬ 
formation  represented  by  the  matrix  S.  Furthermore,  every  quantity  used  to  prove 
the  theorems  of  this  thesis  consisted  of  sums  and  products  of  the  quantities  in  Equa¬ 
tions  3.48  to  3.51.  So  we  conclude  that  equivalent  minimal  representations  are  related 
by  a  basis  transformation  for  the  state  vector  space  of  the  models.  □ 

Theorem  3.4  tells  us  that  the  minimal  representation  obtained  in  Theorem  3.2  is  es¬ 
sentially  unique,  up  to  a  change  of  basis  for  the  state  vector  space.  So  we  have  indeed 
achieved  a  satisfactory  characterization  of  the  degree  of  expressiveness  in  a  GMM  and 
obtained  a  minimal,  canonical  representation  for  the  equivalence  classes  of  GMMs. 
We  will  now  describe  an  algorithm  that  will  canonicalize  a  model  by  reducing  it  to 
its  minimal,  canonical  representation. 

Reduction  Algorithm:  In  order  to  construct  an  algorithm  to  canonicalize  GMMs 
we  will  follow  the  proof  of  Theorem  3.2.  In  order  to  reduce  a  model  A4  to  its  minimal 
equivalent  form,  we  need  to  generate  a  basis  for  the  span  of  U  =  {<?(x)  :  x  €  O'). 
Using  the  methods  developed  in  our  very  first  algorithm  to  check  equivalence  of  prior 
distributions,  we  can  generate  such  a  basis  in  0{n^k)  time,  where  n  is  the  number  of 
states  and  k  is  the  number  of  outputs.  Then  we  use  standard  Gausian  elimination  to 
find  the  linear  dependencies  amongst  the  gi  vectors  defined  in  Equation  3.27.  This  will 
take  time  0{n^  -f  n^fc).[press90]  The  proof  of  Theorem  3.2  shows  that  the  coefficients 
of  these  linear  dependencies  are  the  {/q}  required  by  Lemma  3.2  to  reduce  the  model. 
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The  reduction  procedure  takes  time  0(n^  +  nk)  since  we  simply  have  to  set  the  (at 
most)  0(n^  +  nk)  parameters  of  the  reduced  model  according  to  the  rules  specified 
in  Lemma  3.2.  Therefore,  for  large  k  and  n  we  can  reduce  a  GMM  to  i^s  minimal, 
canonical  representation  in  worst-case  time  0{n^k). 


3.2.1  Results  for  HMMs 

Hidden  Markov  Models  are  derived  from  the  subclass  of  GMMs  with  stochastic  tran¬ 
sition  matrices  by  restricting  the  priors  to  also  be  stochastic.  This  restriction  on  the 
priors  makes  it  a  little  difficult  to  compare  HMMs  directly  to  GMMs.  However,  we 
can  make  good  progress  by  saying  that  a  GMM  M  contains  an  HMM  A/"  if  for  every 
stochastic  prior  p  on  Af,  we  can  find  an  equivalent  pseudo-stochastic  prior  q  on  Ai. 
In  other  other  words,  Ai  contains  Af  if  every  process  that  can  be  modelled  by  HMM 
jV  can  also  be  modelled  by  GMM  Ai.  Now  let  Afa  denote  the  GMM  derived  by 
removing  the  stochasticity  restriction  on  the  priors  on  A''.  Clearly,  if  Afa  Q  Ai  then 
Ai  contains  M,  since  Na  can  model  every  process  modelled  by  Af.  By  definition  of 
containment,  it  is  also  clear  that  if  AA  contains  Af,  then  all  the  stochastic  priors  on 
Afa  can  be  mapped  to  equivalent  priors  on  Ai.  But.  by  Lemma  3.1  this  means  that 
Ac  C  Ai.  So  we  see  that  GMM  Ai  contains  an  HMM  if  and  only  if  Afa  Q  Ai 
where  Afa  is  the  GMM  derived  by  removing  the  stochasticity  restriction  on  the  priors 
on  j\f.  We  can  use  this  to  state  the  following  theoren:. 


Theorem  3.5  Minimal  Representations  of  HMMs 

Suppose  Af  is  an  HMM  and  Af‘  is  the  smallest  HMM  equivalent  to  Af.  Let  Afa  o,nd 
Aq  denote  the  GMMs  derived  by  removing  the  stochasticity  constraints  on  Af  and 
A  *  respectively.  Then  every  GMM  that  contains  Af  must  satisfy  Afa  C  AA. 
Furthermore,  the  minimal  HMM  Af*  has  at  least  as  many  states  as  the  smallest  GMM 
equivalent  to  Afa- 
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Proof:  First  of  all,  suppose  that  is  a  GMM  that  contains  .V.  Then  we  know 

that  the  stochastic  priors  on  Ac  can  be  transformed  into  equivalent  priors  on  W.  By 
Lemma  3.1  we  can  then  conclude  that  Ac  C  M.  Next,  by  assumption  of  A'  A”’, 
the  stochastic  priors  on  A'g  and  A^  can  be  transformed  onto  equivalent  priors  on 
each  other.  Therefore,  Lemma  3.1  tells  us  that  ,Vg  ^  A'q  also.  Now  let  .VI’  be 
the  smallest  GMM  equivalent  to  Ag-  Then,  since  .Vf*  ^  .V^  and  -V(’  is  a  minimal 
model,  we  can  conclude  that  .V*^  has  at  least  as  many  states  as  Ai“ .  □ 

As  a  corollary  of  this  theorem  we  can  show  that  the  smallest  GMM  containing  a  given 
HMM  Af  is  the  minimal  representation  for  Mg-  This  is  because  we  have  shown  that 
every  GMM  M  containing  Af  must  satisfy  A/*c  CM.  It  is  easy  to  show  that  this 
implies  that  the  canonical  dimension  of  M  must  be  at  least  as  large  as  that  of  Ac- 
It  can  also  be  shown  that  if  A  and  B  are  GMMs  with  the  same  canonical  dimension 
and  A  C  B,  then  A  ^  B.  Putting  these  facts  together  we  can  see  that  the  minimal 
representation  of  AAg  is  the  smallest  GMM  we  could  possibly  pick  to  contain  Af. 

Theorem  3.5  showed  that  the  minimal  HMM  representation  of  a  class  of  processes 
will  be  at  least  as  big  as  the  minimal  GMM  containing  that  class.  We  can  also  show 
that  if  we  insist  on  having  a  stochastic  interpretation  of  the  parameters  of  a  model, 
we  may  sometimes  need  many  more  states  than  the  minimal  GMM  can  achieve.  We 
can  see  this  as  follows.  Notice  that  the  space  of  distributions  on  outputs  spanned  by 
a  k-output  HMM  defines  a  convex  polyhedron  on  the  ^  —  1  dimensional  probability 
simplex.  The  vertices  of  the  polyhedron  are  defined  by  the  convex  hull  of  the  output 
distributions  on  the  states.  By  choosing  the  priors  on  the  model  appropriately  we  can 
explore  every  corner  of  the  polyhedron.  In  the  worst  case,  the  output  distributions 
of  every  state  may  fall  on  the  convex  hull,  and  so  it  would  be  impossible  to  build 
a  smaller  stocha.stic  model  of  them.  However,  if  we  permit  ourselves  to  use  general 
linear  combinations,  we  may  find  that  many  of  the  output  distributions  are  linear 
combinations  of  each  other,  which  leads  to  potential  reducibility.  This  shows  that  if 
our  goal  is  to  find  parsimonious  and  ecisily  manipulable  representations  for  stochastic 
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processes,  using  GMMs  would  appear  to  be  a  very  reasonable  course  of  action. 

If  we  insist  on  using  models  with  stochastic  parameters,  it  is  possible  to  define  a 
stochastic  canonical  dimension  of  an  HMM.  This  quantity  would  represent  the  number 
of  “basis  vectors"  we  would  need  if  we  only  used  convex  combinations  in  all  the  places 
where  we  currently  use  general  linear  combinations.  Analysis  of  this  definition  is  more 
diffcult  since  the  “basis  vectors"  for  convex  combinations  correspond  to  vertices  of 
convex  polyhedra  and  the  wealth  of  results  concerning  bases  for  linear  vector  spaces 
is  not  available.  However,  a  brief  consideration  of  the  problem  suggests  that  it  is  very 
likely  that  an  HMM  can  be  reduced  to  a  minimal  stochastic  representation  with  only 
as  many  states  as  its  stochastic  caonical  dimension. 

We  have  now  concluded  the  major  portion  of  this  thesis.  The  next  chapter  will 
discuss  further  directions  of  research  and  point  out  several  questions  that  were  not 
sufficiently  investigated  in  this  work. 
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Chapter  4 


Further  Directions  and  Conclusions 


In  Chapter  3  we  defined  Generalized  Markov  Models,  a  new  class  of  finite-state  repre¬ 
sentations  for  stochastic  processes,  and  saw  how  the  results  on  equivalence  of  HMMs 
could  be  extended  to  GMMs.  We  used  this  to  define  the  canonical  dimension  of  a 
GMM  and  developed  a  complete  characterization  of  the  minimal,  canonical  represen¬ 
tations  for  these  models.  We  also  saw  how  HMMs  are  related  to  GMMs,  and  observed 
that  a  minimal  representation  for  a  stochastic  process  in  the  HMM  formalism  neces¬ 
sarily  has  at  least  as  many  states  as  the  minimal  representation  in  the  GMM  model. 
One  issue  that  wais  not  thoroughly  investigated  in  this  thesis  involves  characterizing 
the  class  of  valid  priors  on  a  GMM.  Since  the  definition  of  GMMs  was  not  construc¬ 
tive,  it  is  not  obvious  what  the  space  of  valid  priors  on  a  model  looks  like.  Hence, 
we  do  not  have  a  characterization  of  the  class  of  valid  transition  matrices  for  GMMs. 
One  way  of  trying  to  understand  this  issue  is  to  apply  the  well-worn  vector  space 
techniques  of  this  thesis  once  again,  this  time  to  the  teisk  of  determining  whether  a 
given  pseudo-prior  is  valid  for  a  model.  Similarly,  we  could  determine  whether  a  given 
transition  matrix  is  allowable.  In  addition  to  the  problem  of  characterizing  the  valid 
priors  on  a  model,  there  are  several  other  important  issues  that  were  not  considered 
in  this  thesis.  We  will  discuss  these  in  the  sections  below. 
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4.1  Reduction  of  Initialized  HMMs 

In  Chapter  3  we  addressed  the  problem  of  reducing  GMMs  to  minimal  canonical 
forms.  We  saw  that  a  class  of  processes  with  canonical  dimension  d  needed  at  least  d 
states  in  its  GMM  representation.  In  many  applications,  after  the  stage  of  training 
parameters  for  a  model  is  completed,  we  will  not  actually  need  the  freedom  of  being 
able  to  set  different  prior  distributions  on  the  model.  In  other  words,  we  will  actually 
be  dealing  with  an  Initialized  GMM.  Since  the  model  now  represents  a  single  process, 
it  may  be  possible  to  reduce  the  number  of  states  still  further.'  So  we  should  consider 
how  to  reduce  an  Initialized  GMM  (-Vf,^  to  a  minimal  representation  such 

that  Af{^  and  Af  has  as  few  states  as  possible. 


4.2  Reduction  While  Preserving  Paths 

In  some  pattern  recognition  applications  of  Hidden  Markov  Models  the  maximum 
likelihood  path  producing  an  output  sequence  i  is  as  important  as  the  probability 
that  X  is  produced.  In  such  cases,  we  will  be  faced  with  two  new  issues  that  were  not 
addressed  in  this  thesis.  First  of  all,  we  will  have  to  give  meaning  to  a  “maximum 
likelihood  path”  in  a  Generalized  Markov  Model.  Secondly,  we  will  have  to  find  a 
method  of  model  reduction  that  preserves  enough  information  about  them  to  recover 
the  identity  of  paths  in  the  original  model  from  paths  in  the  reduced  model.  There 
are  some  applications  in  which  we  are  only  interested  in  peissage  through  some  small 
number  of  states  rather  than  the  entire  path.  In  such  situations,  the  simplest  way  of 
achieving  reduction  while  preserving  paths  would  be  to  declare  the  appropriate  states 
to  be  irreducible.  Such  states  would  never  be  merged  with  others  in  the  reduction 
algorithm  and  so  their  identity  would  be  preserved  in  the  reduced  model. 

'For  example,  suppose  a  GMM  has  one  state  that  loops  on  itself  with  probability  1,  and  we 
initialize  the  model  with  all  the  mass  on  the  looping  state.  Then,  once  we  have  fixed  this  prior,  all 
the  other  states  are  clearly  unnecessary. 


4.3.  TRAINING  GMMS 


I  0 


4.3  Training  GMMs 

When  we  defined  Generalized  Markov  Models  in  Chapter  3  we  made  no  mention  of 
training  algorithms  for  these  models.  This  was  partly  because  the  claiss  of  valid  transi¬ 
tion  matrices  and  priors  was  not  characterized,  and  this  makes  it  difficult  to  evaluate 
whether  a  given  set  of  parameters  induces  valid  stochastic  processes.  Nonetheless, 
there  are  some  options  that  come  to  mind  immediately.  First  of  all,  we  could  use 
corrective  training  methods,  such  as  gradient  descent  to  minimize  a  squared  error 
measure.  (  Niles  [niles90]  suggests  such  a  procedure  in  the  context  of  his  HMM-net.) 
Furthermore,  despite  their  exotic  underlying  chains,  GMMs  still  define  true  probabil¬ 
ity  distributions  on  their  output  sequences.  Consequently,  it  still  makes  sense  to  think 
about  Maximum  Likelihood  methods  where  we  would  attempt  to  set  the  parameters 
of  the  model  to  maximize  the  likelihood  of  a  database  of  examples.  The  easiest  way 
to  derive  a  method  for  updating  the  parameters  would  be  to  follow  the  derivation 
of  Levinson  et  al.,  who  treat  Maximum  Likelihood  Estimation  in  the  framework  of 
clcissical  constrained  optimization.  [levinson83] 

4.4  Approximate  Equivalence 

Although  the  results  of  the  previous  chapter  are  a  complete  characterisation  of  equiv¬ 
alence  and  reduction  of  GMMs,  they  can  be  a  little  unsatisfying,  as  the  following 
example  shows.  Suppose  At  is  a  model  whose  transition  amplitudes  are  all  equal  and 
all  of  whose  output  distributions  are  linearly  independent  of  each  other.  According  to 
our  results,  this  model  is  not  reducible  because  there  is  no  degeneracy  in  the  output 
distributions.  Indeed,  it  is  true  that  we  cannot  build  a  smaller  model  that  agrees  with 
Al  at  all  times.  This  is  because  it  is  always  possible  to  pick  priors  in  such  a  way  as  to 
explore  the  entire  valid  span  of  the  output  distributions  of  Al,  while  a  smaller  model 
could  not  span  a  space  of  the  same  dimension.  Yet,  it  is  clear  that  after  the  first 
output,  the  distribution  over  states  will  be  uniform  and  the  probability  of  emitting 
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various  outputs  will  always  be  unchanged.  VVe  might  like  to  ignore  the  first  output, 
and  say  that  can  be  reduced  to  a  single  state,  since  what  happens  in  the  beginning 
is  an  artifact  of  the  prior.  Minor  modifications  to  our  results  would  accomodate  this 
-  whenever  a  theorem  evaluated  a  condition  for  every  x  e  (9*  U  {t}.  we  would  instead 
evaluate  the  condition  for  {x  ;  |x|  >  1}.  We  would  also  need  to  appropriately  modify 
the  vector  spaces  whose  properties  were  checked  in  various  algorithms  developed  in 
this  thesis. 

However,  this  brings  up  the  more  general  question  of  approximation  algorithms. 
Often,  we  may  not  care  about  what  happens  at  early  or  late  times.  Oi  -se  may  not 
care  if  the  statistics  defined  by  two  models  are  exactly  the  same  so  long  as  they  are 
close.  Approximate  equivalence  in  this  sense  of  “closeness”  of  models  is  particularly 
important  because  the  parameters  of  probabilistic  models  are  usually  estimated  from 
data.  Consequently,  exact  equivalence  will  be  a  rare  event.  Equivalence  ignoring  late 
or  early  times  can  be  easily  handled  within  our  methods  by  various  slight  modifica¬ 
tions  of  our  results.  The  interesting  and  difficult  problem  is  to  define  “closeness”  of 
stochastic  processes  appropriately  and  to  prove  under  what  conditions  the  two  GMMs 
are  “close”  under  the  definition. 


4.5  Practical  Applications 

Finally,  we  should  mention  the  possible  practical  applications  of  this  work,  particu¬ 
larly  since  it  was  originally  begun  in  the  context  of  building  better  practical  classifiers. 
Statitical  methods  and  models  are  being  increasingly  used  in  pattern  recognition  and 
other  fields.  The  models  built  in  some  applications  can  be  very  large  ([kupiec90])  and 
reducing  them  to  equivalent  models  of  smaller  sizes  would  be  computationally  useful. 
However,  since  the  parameters  of  models  are  typically  estimated  from  data,  they  will 
very  rarely  be  exactly  reducible  and  the  approximation  algorithms  mentioned  in  the 
previous  section  will  be  crucial.  Since  we  do  not  currently  have  a  provably  good  al- 


4.5.  PRACTICAL  APPLICATIONS 


gorithm  for  approximate  reduction,  a  reasonable  preliminary  course  to  take  would  be 
to  substitute  tests  for  linear  dependence  with  tests  for  "almost"  linear  dependence  in 
all  the  algorithms  and  results  of  the  previous  chapters.  Of  course,  it  is  also  possible 
to  simply  simply  build  a  smaller  model  and  retrain  it  from  data  rather  than  reducing 
a  laiger  model.  However,  if  the  model  would  take  a  long  time  to  train  (e.g.,  if  the 
database  of  examples  is  very  large),  or  if  the  large  model  was  constructed  for  human 
readability  and  manual  fine  tuning  ([kupiecOO]),  reduction  of  a  large  model  would  be 
a  better  course  of  action.  Even  if  we  prefer  to  retrain  a  smaller  model,  the  canonical 
dimension  defined  in  Chapter  3  could  be  evaluated  as  a  way  of  testing  whether  a 
smaller  model  should  be  built  and  retrained.  The  reduction  algorithm  could  also  be 
used  as  a  way  of  finding  the  structure  of  a  good  smaller  model  that  is  equivalent  or 
nearly  equivalent  to  the  original.  Even  if  we  retrain  the  parameters  of  the  reduced 
model,  the  reduction  step  would  tell  us  how  many  states  we  are  likely  to  need  to  get 
a  good  representation  of  the  statistics  modelled  by  the  larger  model. 

Another  potential  practical  application  involves  the  implementation  and  evalua¬ 
tion  of  GMMs  as  pattern  classifiers.  There  is  some  reason  to  suspect  that  given  a 
GMM  and  an  HMM  with  n  states  each,  the  GMM  could  perform  better  <is  a  pattern 
classifier.  This  is  plausible  because,  given  a  fixed  number  of  states,  a  GMM  can  model 
a  wider  class  of  processes  than  an  HMM.  In  practical  applications  we  are  typically 
dealing  with  the  problem  of  approximating  stochastic  sequences.  There  may  be  pro¬ 
cesses  modelled  by  n-state  GMMs  that  are  much  closer  to  the  true  process  than  the 
best  approximation  we  can  find  in  the  HMM  formalism.  In  order  to  understand  this 
question  from  the  theoretical  viewpoint  we  would  need  to  make  progress  along  several 
fronts  including  understanding  the  approximation  properties  of  HMMs.  For  example, 
we  would  need  to  be  able  to  compare  how  accurately  a  given  stationary  process  can 
be  represented  by  HMMs  and  GMMs  with  n  states  each.  This  is  a  difficult  problem 
worthy  of  being  studied.  From  the  point  of  view  of  practical  applications,  the  question 
of  the  usefulness  of  GMMs  is  best  resolved  empirically  in  the  domain  of  application. 


78 


CHAPTER  4.  FURTHER  DIRECTIOSS  ASD  COSCLUSIOSS 


4.6  Conclusion 

This  thesis  arose  from  an  attempt  to  build  part  of  a  good  foundation  for  pattern 
recognition  using  Hidden  Markov  Models.  There  is  a  need  for  analytical  tools  that 
will  enable  us  to  compare  different  formalisms  for  pattern  recognition  and  in  order 
to  predict  their  relative  effectiveness.  In  this  thesis,  we  have  proved  several  theorems 
that  uncover  the  source  of  the  intrinsic  expressiveness  of  Hidden  Markov  Models.  VVe 
have  shown  how  to  detect  equivalence  of  prior  distributions  on  a  model  and  given 
a  geometric  characterization  of  equivalent  priors.  This  led  to  a  characterization  of 
equivalent  Initialized  Hidden  Markov  Models  and  then  of  equivalent  HMMs.  We  have 
given  theorems  that  detect  these  equivalencies  in  polynomial  time.  Next,  empirical 
and  theoretical  motivations  led  us  to  define  the  cltiss  of  Generalized  Markov  Models 
which  contain  HMMs  as  a  subclass.  We  used  the  definition  to  reduce  HMMs  and 
GMMs  to  minimal,  canonical  representations  which  remove  all  redundancy  from  a 
model.  We  also  developed  a  geometric  characterization  of  the  minimal  representations 
that  gave  insight  into  the  source  of  the  expressiveness  of  GMMs  and  HMMs.  This 
characterization  also  led  to  a  polynomial  time  reduction  algorithm  for  Generalized 
Markov  Models.  These  results  lay  part  of  a  foundation  for  the  principled  use  of 
finite-state  models  of  stochcistic  processes  in  pattern  recognition. 


Appendix  A 


HMMs  and  Probabilistic  Automata 


There  have  been  some  recent  results  in  the  theory  of  Probabilistic  Automata  (PAs) 
that  use  methods  very  similar  to  ours  to  decide  the  equivalence  of  PAs  in  polyno¬ 
mial  time.[tzeng]  Tzeng’s  work  also  discusses  a  result  on  approximate  equivalence  of 
PAs  that  may  provide  leads  on  ways  to  proceed  towards  understanding  approximate 
equivalence  of  HMMs  and  GMMs.  In  this  appendix  we  will  show  how  HMMs  and  PAs 
are  related.  First  of  all,  we  will  define  Probabilistic  Automata  in  Tzeng’s  formulation. 

Definition  A.l  Probabilistic  Automata 

Let  A4{i,j)  denote  the  set  of  all  ix  j  stochastic  matrices.  A  Probabilistic  Automaton 
U  is  a  5-tuple  {S,T,,  M,  p,  F),  where  S  —  {^i,  sj, . . . ,  s„}  is  a  finite  set  of  states,  S  is 
an  input  alphabet,  M  is  a  function  from  S  into  M{n,n),  p  is  a  prior  distribution  on 
the  states,  and  F  Q  S  is  a  finite  set  of  final  states.  M[a)ji  is  is  the  probability  that 
U  moves  from  state  Si  to  Sj  after  reading  the  symbol  <t  €  S.  We  say  that  x  €  £** 
is  accepted  by  U  with  probability  pu{x)  if  U  ends  up  in  a  final  state  with  probability 
pui^)  on  reading  x. 

It  is  clear  from  this  definition  that  PAs  are  closely  related  to  HMMs.  The  matrices 
M{a)ji  are  similar  in  meaning  to  HMM  transition  operators,  barring  the  complication 

^Ve  are  using  the  standard  notation  that  E*  is  the  set  all  finite  length  strings  that  can  be 
produced  by  concatentating  symbols  in  E  together. 
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of  the  final  states,  the  rest  of  the  model  is  similar  also.  We  will  show  that  with  an 
appropriate  definition  of  equivalence,  Initialized  HMMs  are  a  subclass  of  Probabilistic 
Automata. 

Definition  A. 2  Equivalence  of  PAs  and  HMMs 

Let  U  be  a  Probabilistic  Automaton  and  let  (M,p)  be  a  Hidden  Markov  Model  with 
O  =  E,  where  O  is  the  output  set  of  Ai,  and  E  is  the  input  alphabet  of  U.  fA'e  will 
say  that  U  and  M.  are  equivalent  (U  ^  {M.,p))  i/ Pr(x|,Vd.^  =  ptf(x)  for  every 
X  eO'O  {e}. 

Definition  A. 2  says  that  a  Probabilistic  Automaton  U  is  equivalent  to  a  Hidden 
Markov  Model  Afwhen  U  accepts  strings  with  the  same  probability  that  M  emits 
them.  We  will  now  show  that  under  this  definition  of  equivalence,  HMMs  are  con¬ 
tained  in  the  class  of  Probabilistic  Automata. 

Theorem  A.l  HMMs  C  PAs 

The  class  of  Hidden  Markov  Models  is  contained  in  the  class  of  Probabilistic  Automata 
when  equivalence  is  defined  by  Definition  A. 2. 

The  basic  idea  of  the  proof  is  shown  in  Figure  A.  Given  an  HMM  M,  we  will  build 
a  Probabilistic  Automaton  U  with  the  same  states  plus  one  extra  state  to  leak  away 
excess  probabilities.  We  will  then  set  up  the  matrices  M{cr)j,  to  mimic  the  action 
of  the  HMM  transition  operators  on  the  states  that  the  two  models  share.  If  every 
one  of  the  shared  states  is  a  final  state  of  L ,  this  will  guarantee  that  and  U  are 
equivalent. 

Proof:  Suppose  Ai  =  {S,  O,  A,  B)  is  any  HMM  with  prior  p.  Construct  a  PA 

U  =  (5[/,  E,  M,  p,  F)  where: 


Su  = 


V 


«3  U  {-Str} 

O 


(A.l) 

(A.2) 
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F  =  S 

' 

p,  if  ^,  €  5 

0  otherwise 

if  *S|i  Sj  ^  S 

1  -  Il3.es  if  Sj  es.  s,  =  Si 

0  if  Sj  =  su,  s,  €  States 
1  if  5,  =  =  Su 

VVe  will  prove  by  induction  on  the  length  of  strings  x  that  pu{^)  = 

for  every  string  x.  First  of  all,  both  models  accept  the  null  string  with  probability  1, 

since  both  start  with  all  the  mass  of  a  stochastic  vector  on  the  states  S.  Furthermore, 

Pr(s,,elt/)  =  Pr(si, £|>l,p).  Now  suppose  that  for  every  string  i,  such  that  |i|  <  t, 

it  is  true  that  pu{x)  =  Pr(x|A^,p)  and  that  for  every  state  s,  €  S.  Pr(s,, x|t/)  = 

Pr(s,,i|.V(,^.  Then  for  any  symbol  <r  €  S  =  C?,  and  every  state  S;,  it  will  be  true 

that: 

|5| 

Pr(s„x(T|t/)  =  J^(B,.A,.Pr(s.,xlf/))  +  Pr(st/,x)f/)M(a),t;  (A.6) 

i=l 

1^1 

=  5;B,.A,.Pr(s.,i|M,^  (A.7) 

=  Pr(sj,x<T(A<,^  (A. 8; 

Furthermore,  since  the  accepting  states  of  U  are  exactly  the  states  in  <5  it  also  fol¬ 
lows  that  pu{x<t)  =  Pr(sj,icr|f/)  =  Fv{x(7\M,p).  By  induction  on  t  =  lx|, 

we  can  conclude  that  for  every  string  i,  the  probability  that  x  is  produced  by  is 
equal  to  the  probability  that  x  is  accepted  by  U.  Hence,  [M.p)  •<=>  U  and,  therefore, 
HMMs  C  PAs.  On  the  other  hand,  it  is  easy  to  show  that  there  are  probabilis¬ 
tic  automata  that  cannot  be  implemented  as  Hidden  Markov  Models.  For  example, 
define  the  support  of  a  PA  to  the  set  of  strings  that  are  accepted  with  non-zero  prob¬ 
ability.  The  support  of  an  HMM  would  be  the  set  of  strings  that  are  emitted  with 


(A.3) 

(A.4) 

(A.5) 
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non-zero  probability.  Because  of  the  definitions  of  the  models,  a  P,\  could  have  a  finite 
support,  or  even  a  support  that  consists  only  of  strings  longer  than  a  fixed  length. 
Neither  of  these  two  cases  is  possible  for  an  HMM.  Furthermore,  the  various  M[(7) 
matrices  of  a  PA  need  not  bear  any  relationship  to  each  other,  while  the  correspond¬ 
ing  transition  operators  of  HMMs  are  closely  related  to  each  other,  via  the  A  and 
B  matrices.  For  this  reason  also  it  is  possible  to  define  P.\s  that  are  not  equivalent 
to  any  Hidden  Markov  Model  in  our  formulation  of  equivalence.  So  we  conclude  that 
HMMs  C  PAs.  □ 


Given  an  HMM  M,  we  can  construct  a  Probabilistic  Automaton 
U  that  is  equivalent  to  M  by  copying  over  the  structure  of  M  and 
adding  one  extra  state  to  soak  up  excess  probabilities.  (See  text 
for  discussion.) 


Figure  .A-1:  Constructing  a  Probabilistic  Automaton  Equivalent  to  .An  HMM 


Theorem  A.l  shows  that  HMMs  can  be  considered  a  subclass  of  Probabilistic 
•Automata.  However,  if  the  number  of  outputs  is  large,  an  n-state  P.A  will  require 


many  more  parameters  to  describe  it  than  an  n-state  HMM.  VV’.Tzeng  proves  results 
concerning  equivalence  of  Probabilistic  Automata  using  methods  that  are  similar  to 
ours.[tzeng]  He  also  discusses  the  problem  of  approximate  equivalence  of  PAs  and 
arrives  at  some  interesting  results.  Since  HMMs  and  PAs  are  so  closely  related,  the 
methods  used  by  Tzeng  to  extract  results  concerning  approximate  equivalence  can 
guide  us  in  studying  the  same  question  for  HMMs. 
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